Commit Graph

238 Commits

Author SHA1 Message Date
sychen
4cb4701ede
[CELEBORN-689] Fix the incorrect part of PushDataHandler message type converted to status code
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1601 from cxzl25/CELEBORN-689.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-19 11:25:08 +08:00
Angerszhuuuu
0aa13832b5 [CELEBORN-676] Celeborn fetch chunk also should support check timeout
### What changes were proposed in this pull request?
Celeborn fetch chunk also should support check timeout

#### Test case
```
executor instance 20

SQL:
SELECT count(1) from (select /*+ REPARTITION(100) */ * from spark_auxiliary.t50g) tmp;

--conf spark.celeborn.client.spark.shuffle.writer=sort \
--conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \
--conf spark.celeborn.client.push.timeout=10s \
--conf spark.celeborn.client.push.replicate.enabled=true \
--conf spark.celeborn.client.push.revive.maxRetries=10 \
--conf spark.celeborn.client.reserveSlots.maxRetries=10 \
--conf spark.celeborn.client.registerShuffle.maxRetries=3 \
--conf spark.celeborn.client.push.blacklist.enabled=true \
--conf spark.celeborn.client.blacklistSlave.enabled=true \
--conf spark.celeborn.client.fetch.timeout=30s \
--conf spark.celeborn.client.push.data.timeout=30s \
--conf spark.celeborn.client.push.limit.inFlight.timeout=600s \
--conf spark.celeborn.client.push.maxReqsInFlight=32 \
--conf spark.celeborn.client.shuffle.compression.codec=ZSTD \
--conf spark.celeborn.rpc.askTimeout=30s \
--conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \
--conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \
--conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \
--conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true
```

Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest`

Before patch
<img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206">

After patch
<img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e">

The log shows the fetch timeout checker workers
```
23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed.
org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT
	at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147)
	at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[
  id-epoch:35-0
  host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095
  mode:MASTER
  peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095)
  storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=}
  mapIdBitMap:null], change to peer
org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed.
	at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98)
	at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146)
	at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT
	at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147)
	... 8 more
23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1587 from AngersZhuuuu/CELEBORN-676.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:54:09 +08:00
Shuang
da85347330 [CELEBORN-675] Fix decode heartbeat message
### What changes were proposed in this pull request?
Give Heartbeat one byte message and skip this byte when decode.

### Why are the changes needed?
Heartbeat message may split in to two netty buffer,  then the `empty buffer` (which don't need actually, but need keep) be wrong removed, then decodeNext would throw NPE. see
```  java
while (headerBuf.readableBytes() < HEADER_SIZE) {
      ByteBuf next = buffers.getFirst();
      int toRead = Math.min(next.readableBytes(), HEADER_SIZE - headerBuf.readableBytes());
      headerBuf.writeBytes(next, toRead);
      if (!next.isReadable()) {
        buffers.removeFirst().release();
      }
    }
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT & MANUAL

Closes #1589 from RexXiong/CELEBORN-675.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-06-14 14:37:13 +08:00
Angerszhuuuu
f2357bf75c [CELEBORN-671] Add hasPeer method to PartitionLocation
### What changes were proposed in this pull request?
Add hasPeer method to PartitionLocation

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1583 from AngersZhuuuu/CELEBORN-671.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-14 10:29:16 +08:00
zky.zhoukeyong
76831e805d [CELEBORN-668] Report WorkerLost instead of WorkerUnavailable if grac…
…eful is disabled

### What changes were proposed in this pull request?
Worker should report WorkerLost instead of WorkerUnavailable in it's shutdown hook if graceful shutdown is disabled.

### Why are the changes needed?
To avoid unnecessary commit file requests from lifecycle manager since it's not graceful shutdown.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Closes #1580 from waitinfuture/668.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-13 11:30:59 +08:00
zky.zhoukeyong
dab865b68b [CELEBORN-662] Report worker unavailable regardless graceful shutdown
### What changes were proposed in this pull request?
In this PR, worker always report node unavailable regardless graceful shutdown is turned on or off.

### Why are the changes needed?
To inform master the shutting down worker as soon as possible.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1575 from waitinfuture/662.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-10 18:36:25 +08:00
Cheng Pan
76533d7324
[CELEBORN-650][TEST] Upgrade scalatest and unify mockito version
### What changes were proposed in this pull request?

This PR upgrades

- `mockito` from 1.10.19 and 3.6.0 to 4.11.0
- `scalatest` from 3.2.3 to 3.2.16
- `mockito-scalatest` from 1.16.37 to 1.17.14

### Why are the changes needed?

Housekeeping, making test dependencies up-to-date and unified.

### Does this PR introduce _any_ user-facing change?

No, it only affects test.

### How was this patch tested?

Pass GA.

Closes #1562 from pan3793/CELEBORN-650.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-09 10:04:14 +08:00
onebox-li
0c869ac9a0
[CELEBORN-642] Improve metrics and update grafana
### What changes were proposed in this pull request?
Change in grafana

(ALL)
add:
JVMCPUTime
LastMinuteSystemLoad
AvailableProcessors
(For Master)
add:
LostWorkers
IsActiveMaster
PartitionSize
(For Worker)
add:
PushDataFailCount -> WriteDataFailCount
ReplicateDataFailCount
ReplicateDataWriteFailCount
ReplicateDataCreateConnectionFailCount
ReplicateDataConnectionExceptionCount
ReplicateDataTimeoutCount
SortedFileSize
PushDataHandshakeFailCount
RegionStartFailCount
RegionFinishFailCount
MasterPushDataHandshakeTime
SlavePushDataHandshakeTime
MasterRegionStartTime
SlaveRegionStartTime
MasterRegionFinishTime
SlaveRegionFinishTime
PotentialConsumeSpeed
UserProduceSpeed
WorkerConsumeSpeed
DeviceOSFreeBytes
DeviceCelebornFreeBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory
remove:
dup ReserveSlotsTime

Change dashboard layout.

Fix support for multiple labels.

Modify some metrics docs.

### Why are the changes needed?
For better use of metrics.

### Does this PR introduce _any_ user-facing change?
Below metrics change name, extract some value to the label.
DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes
DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes
DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes
DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory

### How was this patch tested?
Cluster test.

Closes #1557 from onebox-li/improve-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-08 18:10:06 +08:00
zhongqiang.czq
586785c88d [CELEBORN-617][FLINK] MapPartitionFileWriter updates flushing file length
…ngth

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1519 from zhongqiangczq/mapfilelength.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-08 10:47:36 +08:00
Angerszhuuuu
d4cb6dd8ab [CELEBORN-645][REFACTOR] Refine logic about handle HeartbeatFromWorkerResponse
### What changes were proposed in this pull request?
Refine the logic here to make it easier understand.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1555 from AngersZhuuuu/CELEBORN-645.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-07 16:34:44 +08:00
Cheng Pan
3c7d179e05
[CELEBORN-636] Replace SimpleDateFormat with FastDateFormat
### What changes were proposed in this pull request?

`SimpleDateFormat` is not thread-safe, replace it with a thread-safe `FastDateFormat`

### Why are the changes needed?

`FastDateFormat` is a fast and thread-safe version of `java.text.SimpleDateFormat`.

### Does this PR introduce _any_ user-facing change?

Yes, it's a bug fix.

### How was this patch tested?

Manually review.

Closes #1545 from pan3793/CELEBORN-636.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-06 12:59:32 +08:00
Shuang
a1de16a80f
[CELEBORN-626] Fix potential deadlock in filewriter
### What changes were proposed in this pull request?
Lock flushBuffer field and flush method to make sure thread safe access.

### Why are the changes needed?
When stageEnd, worker will commit files and filewriters would be closed, the speculative task may still push data to the file writer, if the push task increment numPendingWrites. the commit thread which hold the filewriter object lock will need wait the pending writes decrement to 0. but push thread need the filewriter object lock to  decrement numPendingWrites, this cause deadlock..

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #1534 from RexXiong/CELEBORN-626.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-02 17:47:39 +08:00
Angerszhuuuu
e18a5ea769
[CELEBORN-624] StorageManager should only remove expired app dirs (#1531) 2023-06-02 11:33:33 +08:00
Angerszhuuuu
cf308aa057
[CLEBORN-595] Refine code frame of CelebornConf (#1525) 2023-06-01 10:37:58 +08:00
Angerszhuuuu
62681ba85d
[CELEBORN-595] Rename and refactor the configuration doc. (#1501) 2023-05-30 15:14:12 +08:00
zhongqiangchen
f117cff776
[CELEBORN-618] [FLINK] worker side adds partition split configuration options (#1520) 2023-05-30 14:13:31 +08:00
Angerszhuuuu
c4bff654b0
[CELEBORN-614] Simplify StorageManager's flushFileWriters to avoid too much cost on collection operation (#1517) 2023-05-30 11:38:05 +08:00
Angerszhuuuu
6619015a63
[CELEBORN-596] Worker don't need to update disk max slots (#1502) 2023-05-23 10:30:35 +08:00
Ethan Feng
7015d2463a
[CELEBORN-583] Merge pooled memory allocators. (#1490) 2023-05-18 10:37:30 +08:00
Leo Li
65cdb3eba4
[CELEBORN-585] Create if not exists worker recoverPath when graceful shutdown is enabled (#1487) 2023-05-17 11:29:09 +08:00
Angerszhuuuu
64a3534f71
[CELEBORN-584] Worker side should expose push/replicate/fetch Netty allocator metrics (#1489) 2023-05-16 17:51:33 +08:00
Angerszhuuuu
d657f8268a
[CELEBORN-586] Add SystemMiscSource to indicate system running status (#1488) 2023-05-16 14:03:07 +08:00
zhongqiangchen
5769c3fdc7
[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457) 2023-05-10 19:35:51 +08:00
Shuang
fb753fd48e
[CELEBORN-573] Guarantee resource/app/worker change persistent to raft in Ha Mode. (#1477) 2023-05-10 14:28:52 +08:00
Angerszhuuuu
5f7e1ce8e2
[CELEBORN-578][REFACTOR] Refine commit file's log to indicate more clear about empty partitions (#1481) 2023-05-08 18:21:46 +08:00
Angerszhuuuu
7a4f2ebd8a
[CELEBORN-547] Refactor request related API (#1452) 2023-04-27 16:25:41 +08:00
Angerszhuuuu
be84e8ba0d
[CELEBORN-562][REFACTOR] Rename Destroy and DestroyResponse to make it more clear (#1467) 2023-04-27 12:31:32 +08:00
Shuang
0b2e4877bd
[CELEBORN-553] Improve IO (#1458) 2023-04-25 21:14:06 +08:00
Ethan Feng
01d8d1079c
[CELEBORN-550][FLINK] Fix bufferQueue release and poll concurrent problem. (#1455) 2023-04-25 15:06:06 +08:00
Shuang
d68deecaaa
[CELEBORN-546][FLINK] Use autoIncrement partitionId replace encode(mapId, attemptId) for generating partitionId (#1447) 2023-04-22 16:33:22 +08:00
Angerszhuuuu
0c2d3e647d
[CELEBORN-532][METRICS] Refine push-related failure metrics (#1442)
* [CELEBORN-532][METRICS] Refine push-related failure metrics
2023-04-21 17:05:43 +08:00
Angerszhuuuu
181c1bfcd6
[CELEBORN-524][PERF] CongestionControl call too much ChannelsLimiter onTrim cause CPU stuck or occupy too much CPU cause no cpu for handlePushData (#1428) 2023-04-21 15:44:56 +08:00
Ethan Feng
6378a386d0
[CELEBORN-530][REFACTOR] Move stream manager and memory manager to worker module. (#1439) 2023-04-20 10:17:26 +08:00
Angerszhuuuu
932ccd0841
[CELEBORN-523][REFACTOR] Remove unnecessary code in WorkerPartitionLocationInfo (#1427) 2023-04-15 22:36:48 +08:00
Shuang
a22c6ca749
[CELEBORN-521] correct exception and unify unRetryableException (#1425) 2023-04-15 22:27:28 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Rex(Hui) An
0b402b5903
[CELEBORN-522] Add worker consume speed metric 2023-04-14 13:38:49 +08:00
Angerszhuuuu
3a21362265
[CELEBORN-511][IMPROVE] Move onTrim tag to StorageManager to avoid frequent trim action (#1415)
* [CELEBORN-511][IMPROVE] Move onTrim tag to StorageManager to avoid frequent trim action
2023-04-14 10:35:51 +08:00
Ethan Feng
9cccfc9872
[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407) 2023-04-13 10:37:47 +08:00
Angerszhuuuu
da98ed9bea
[CELEBORN-516][PERF] Remove RPCSource since it cost too much CPU (#1420) 2023-04-12 18:47:06 +08:00
Angerszhuuuu
f574a4dafa
[CELEBORN-512][IMPROVEMENT] Sort timestamp and show in date format (#1416) 2023-04-11 19:56:48 +08:00
Angerszhuuuu
cad2836e85
[CELEBORN-505] Fix typo of SHUFFLE_CHUCK_SIZE (#1411) 2023-04-04 19:15:30 +08:00
Fei Wang
b40c573069
[CELEBORN-474][FOLLOWUP] Using inner static ConcurrentHashMap class and only apply for JDK8 (#1384) 2023-03-27 16:16:23 +08:00
Fei Wang
7c444cb0c5
[CELEBORN-474] Speed up ConcurrentHashMap#computeIfAbsent (#1383) 2023-03-26 09:41:59 +08:00
Fei Wang
c609c0ebaa
[MINOR] Fix typo and remove unused code (#1381)
* fix typo

* remove unused
2023-03-25 23:20:33 +08:00
Shuang
89b3f3887d
[CELEBORN-356] [FLINK] Support release single partition resource (#1314) 2023-03-24 17:15:28 +08:00
Keyong Zhou
3d6fba553b
[CELEBORN-454] Code refine for worker (#1371) 2023-03-22 10:39:14 +08:00
Keyong Zhou
13c610838b
[CELEBORN-455] Use 4 bytes instead of 16 to read mapId in FileWriter.… (#1369) 2023-03-21 19:21:50 +08:00
Angerszhuuuu
56d796638f
[CELEBORN-438] Move ServletPath to MetricsSytsem (#1364) 2023-03-20 18:22:40 +08:00
乐活优格
0b78c6d325
[CELEBORN-442]Support hdfs compatible file system (#1360) 2023-03-18 11:47:46 +08:00
Angerszhuuuu
e61130d397
[CELEBORN-423][FOLLOWUP] Format http request (#1353)
* [CELEBORN-423][FOLLOWUP] Format http request
2023-03-15 16:30:23 +08:00
Angerszhuuuu
1f56a5e5d1
[CELEBORN-423] Format http request result (#1349) 2023-03-15 10:32:01 +08:00
Angerszhuuuu
3907d70212
[CELEBORN-421] Add shutdown and registered to http request (#1346)
* [CELEBORN-421] Add shutdown and registered to http request
2023-03-14 18:23:21 +08:00
Angerszhuuuu
7d7279a9bc
[CELEBORN-420] Add unavailablePeers to http request (#1345)
* [CELEBORN-420] Add unavailablePeers to http request
2023-03-14 17:23:45 +08:00
Angerszhuuuu
3600ccc4e3
[CELEBORN-409] Add PartitionLocationInfo to worker's http request (#1335) 2023-03-13 17:02:28 +08:00
Angerszhuuuu
6f1ab70403
[CELEBORN-406] Add blacklist to http request to indicate blacklisted worker (#1334) 2023-03-13 16:44:46 +08:00
Angerszhuuuu
144a8cdb3f
[CELEBORN-408] Add lost worker infos to http request (#1333) 2023-03-13 15:27:41 +08:00
Ethan Feng
bb8401e401
[CELEBORN-403][FLINK] Add metrics about buffer dispatcher request queue length. (#1329) 2023-03-13 11:15:00 +08:00
Angerszhuuuu
a336f12cc8
[CELEBORN-400] Add RPC metrics for OpenStream (#1326) 2023-03-10 21:22:05 +08:00
Angerszhuuuu
4b334df7a6
[CELEBORN-399] Make fileSorterExecutors thread num can be customized (#1325) 2023-03-10 21:10:43 +08:00
jiaoqingbo
84795bc63b
[CELEBORN-382] Call checkDiskFullAndSplit in the handlePushData method to avoid repeated definitions (#1313) 2023-03-07 18:55:46 +08:00
Ethan Feng
675a7da393
[CELEBORN-368][FLINK] Pass exceptions in buffer stream. (#1304) 2023-03-03 15:43:30 +08:00
Keyong Zhou
dcedf7b0a9
[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287) 2023-03-02 18:31:50 +08:00
Angerszhuuuu
eda21ead24
[CELEBORN-344] Change PUSH_DATA_FAIL_MASTER/SALVE to PUSH_DATA_WRITE_FAIL_MASTER/SALVE (#1281) 2023-02-28 11:29:40 +08:00
Keyong Zhou
7adf1fca41
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
2023-02-28 10:35:55 +08:00
Angerszhuuuu
24f5478adc
[CELEBORN-338] Clean duplicated exception message of handling push data (#1274) 2023-02-28 10:35:18 +08:00
Rex(Hui) An
798ff90bb7
[CELEBORN-342] Fix the wrong avg produce bytes in Congestion control (#1279) 2023-02-27 16:29:37 +08:00
Keyong Zhou
3c8c58e09d
[CELEBORN-301] Refactor PartitionLocationInfo to use ConcurrentHashMap (#1278) 2023-02-26 16:46:30 +08:00
Angerszhuuuu
a7587c3fe7
[CELEBORN-337] Remove unnecessary StatusCode.message (#1272)
* [CELEBORN-337] Remove unnecessary StatusCode.message
2023-02-24 15:11:07 +08:00
Shuang
9754616d79
[CELEBORN-330] fix deadlock when use the same netty channel to receive data while other thread wait the response (#1265) 2023-02-23 17:57:43 +08:00
Angerszhuuuu
fc8540a2e6
[CELEBORN-325] After worker restart, throw NPE when receive not found partition (#1259)
* [CELEBORN-325] After worker restart, throw NPE when receive not found partition
2023-02-22 15:19:34 +08:00
Ethan Feng
0df08fbdf3
[CELEBORN-320][FLINK] fix handle wrong message type in FetchHandler. (#1254) 2023-02-21 11:51:01 +08:00
Ethan Feng
26a3bb5e72
[CELEBORN-308] Fix flusher will exit unexpectedly if flush task write failed. (#1249) 2023-02-20 21:45:37 +08:00
Ethan Feng
0c8bb83114
[CELEBORN-234] Implement buffer stream. (#1221) 2023-02-17 17:38:36 +08:00
zhongqiangchen
5236df68af
[CELEBORN-292] optimize mappartitionfilewriter flushing index and reading data header (#1225) 2023-02-17 13:42:28 +08:00
zhongqiangchen
79096d60d0
[CELEBORN-293] WorkerSource registers timer for mappartition message metrics (#1226) 2023-02-17 11:29:54 +08:00
Ethan Feng
1dcfdb0c8f
[CELEBORN-281] Add metrics about buffer stream read buffer. (#1216) 2023-02-17 11:20:07 +08:00
Angerszhuuuu
57f775a7e9
[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208) 2023-02-16 18:27:37 +08:00
Ethan Feng
534853bf8a
[CELEBORN-278] Add openStreamWithCredit RPC. (#1214) 2023-02-16 14:07:13 +08:00
zhongqiangchen
2c508dae0f
[CELEBORN-307] fix ArrayComparisonFailure while running lz4 ut (#1241) 2023-02-16 13:41:17 +08:00
Rex(Hui) An
2068e6ae37
[CELEBORN-279] Add user level push data speed metric (#1213) 2023-02-13 12:04:44 +08:00
Rex(Hui) An
adb6592d31
[CELEBORN-277] PushDataHandle callback could miss soft split status (#1212) 2023-02-09 14:57:18 +08:00
Rex(Hui) An
f88f5fcf55
[CELEBORN-207][FOLLOW_UP] Master could miss the congestion status if enable push.data.replicate 2023-02-07 22:57:39 +08:00
Rex(Hui) An
cfe81969c9
[CELEBORN-275] WrappedCallback should only handle response from replica (#1209) 2023-02-07 18:18:13 +08:00
Rex(Hui) An
bb113ec9be
[CELEBORN-207] Support network congestion control (#1066) 2023-02-07 12:06:18 +08:00
Angerszhuuuu
c4020100db
[CELEBORN-271][BUG] PushState in PushDataHandler should should use peer's location 2023-02-06 11:31:57 +08:00
Angerszhuuuu
ecc3a0e52f
[CELEBORN-272][BUG] Don't do replication should directly use callback not wrappedCallback (#1205) 2023-02-06 11:28:12 +08:00
zhongqiangchen
8e903840af [CELEBORN-243][REWORK]fix bug that os's disk usage is low but celeborn thinks that it's high_disk_usage (#1202) 2023-02-04 14:27:44 +08:00
Angerszhuuuu
2e68912812
[CELEBORN-269][BUG] Disable replication throw NPE when removeBatch in pushDataHandler (#1203) 2023-02-03 20:06:59 +08:00
Shuang
2634476758
[CELEBORN-267] reuse stream when client channel reconnected (#1200) 2023-02-03 15:12:45 +08:00
Angerszhuuuu
4b6f7e4593
[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185) 2023-02-03 11:53:15 +08:00
zhongqiangczq
ff17a61ec5
[CELEBORN-243] fix bug that os's disk usage is low but celeborn thinks that it's high_disk_usage (#1184) 2023-02-02 10:41:11 +08:00
Shuang
7162be2fae
[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149) 2023-01-31 18:53:36 +08:00
Angerszhuuuu
1311fb53d1
[CELEBORN-243][CELEBORN-245][IMPROVEMENT] Create push client failed and connection failed cause push failed should have their own ERROR type (#1181)
* [CELEBORN-243][IMPROVEMENT] Create push client failed should have a ERROR type
2023-01-30 17:47:22 +08:00
Angerszhuuuu
8611a64400
[CELEBORN-237][IMPROVEMENT] push failed error message should show partition info (#1178)
* [CELEBORN-237][IMPROVEMENT] push failed error message should show partition info
2023-01-28 18:41:54 +08:00
Ethan Feng
a239f9f284
[CELEBORN-228]Refactor PartitionFileSorter to avoid specific JDK dependency. (#1168) 2023-01-16 20:06:47 +08:00
zy.jordan
bb96700415
[CELEBORN-223] The default rpc thread num of pushServer/replicateServer/fetchServer should be the number of total of Flusher's thread (#1163) 2023-01-16 12:03:46 +08:00
zhongqiangczq
3661222d98
[CELEBORN-195] add implementation to MapPartitionFileWriter (#1141) 2023-01-13 16:41:11 +08:00
zy.jordan
19197b9190
[CELEBORN-214] Push/Replicate/Fetch io threads default value is 16 (#1158) 2023-01-10 17:46:56 +08:00
nafiy
9635725480
[CELEBORN-204][IMPROVEMENT]Collect disk usage metrics in byte unit by default (#1153) 2023-01-09 17:36:18 +08:00