Commit Graph

83 Commits

Author SHA1 Message Date
zwangsheng
6c2fdf7477
[CELEBORN-1188][TEST] Using JUnit function instead of java assert
### What changes were proposed in this pull request?
Using Junit function instead of java assert.

### Why are the changes needed?
When java assert fail, will throw AssertException, which is hard to find diff.

![截屏2023-12-20 10 34 52](https://github.com/apache/incubator-celeborn/assets/52876270/b36421a5-64e1-4717-a6d4-3b08db403293)

Instead, when we use junit assert, we can clearly find diff.

![截屏2023-12-20 11 17 21](https://github.com/apache/incubator-celeborn/assets/52876270/ce39fa20-e9ab-4419-a4ca-62c4157e4b2c)

### Does this PR introduce _any_ user-facing change?
NO, only test changed

### How was this patch tested?
Run CI

Closes #2173 from zwangsheng/CELEBORN-1188.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-20 21:20:38 +08:00
sychen
7f653ce7d6 [CELEBORN-1190] Apply error prone patch and suppress some problems
### What changes were proposed in this pull request?
1.  Fix MissingOverride, DefaultCharset, UnnecessaryParentheses Rule
2. Exclude generated sources, FutureReturnValueIgnored, TypeParameterUnusedInFormals, UnusedVariable

### Why are the changes needed?
```
./build/make-distribution.sh --release
```
We get a lot of WARNINGs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2177 from cxzl25/error_prone_patch.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-20 20:54:18 +08:00
zky.zhoukeyong
01feb93abb [CELEBORN-1167] Avoid calling parmap when destroy slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When LifecycleManager call destroy slots

This PR fixes the fourth one. To be more detail, this PR eliminates `parmap` when destroying slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2156 from waitinfuture/1167.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 18:30:31 +08:00
Erik.fang
aee41555c6 [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure
### What changes were proposed in this pull request?
Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission.

Design doc:
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit

### Why are the changes needed?
Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it

### Does this PR introduce _any_ user-facing change?
a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature

### How was this patch tested?
two UTs are attached, and we also tested it in Ant Group's Dev spark cluster

Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-26 16:47:58 +08:00
liangyongyuan
69e14fd341 [CELEBORN-1128] Fix incorrect method reference in ConcurrentHashMap.contains
### What changes were proposed in this pull request?
ConcurrentHashMap.contains main containsValue ,not containsKey. In the current codebase, there is a misuse of the contains method in the ConcurrentHashMap class.

### Why are the changes needed?
ConcurrentHashMap.contains misuse

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #2102 from lyy-pineapple/hashMap.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-15 19:48:39 +08:00
TongWei1105
0583cdb5a8 [CELEBORN-1048] Align fetchWaitTime metrics to spark implementation
### What changes were proposed in this pull request?
Align fetchWaitTime metrics to spark implementation

### Why are the changes needed?
In our production environment, there are variations in the fetchWaitTime metric for the same stage of the same job.

ON YARN ESS:
![image](https://github.com/apache/incubator-celeborn/assets/68682646/601a8315-1317-48dc-b9a6-7ea651d5122d)
ON CELEBORN
![image](https://github.com/apache/incubator-celeborn/assets/68682646/e00ed60f-3789-4330-a7ed-fdd5754acf1d)
Then, based on the implementation of Spark ShuffleBlockFetcherIterator, I made adjustments to the fetchWaitTime metrics code

Now, looks like more reasonable, 
![image](https://github.com/apache/incubator-celeborn/assets/68682646/ce5e46e4-8ed2-422e-b54b-cd594aad73dd)
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
yes, tested in our production environment

Closes #2000 from TongWei1105/CELEBORN-1048.

Lead-authored-by: TongWei1105 <vvtwow@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-02 15:27:30 +08:00
SteNicholas
3092644168 [CELEBORN-1095] Support configuration of fastest available XXHashFactory instance for checksum of Lz4Decompressor
### What changes were proposed in this pull request?

`CelebornConf` adds `celeborn.client.shuffle.decompression.lz4.xxhash.instance` to configure fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor`. Fix #2043.

### Why are the changes needed?

`Lz4Decompressor` creates the checksum with `XXHashFactory#fastestInstance`, which returns the fastest available `XXHashFactory` instance that uses nativeInstance at default. The fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor` could be supported to configure instead of dependency on the class loader is the system class loader, which method is as follows:
```
/**
 * Returns the fastest available {link XXHashFactory} instance. If the class
 * loader is the system class loader and if the
 * {link #nativeInstance() native instance} loads successfully, then the
 * {link #nativeInstance() native instance} is returned, otherwise the
 * {link #fastestJavaInstance() fastest Java instance} is returned.
 * <p>
 * Please read {link #nativeInstance() javadocs of nativeInstance()} before
 * using this method.
 *
 * return the fastest available {link XXHashFactory} instance.
 */
public static XXHashFactory fastestInstance() {
  if (Native.isLoaded()
      || Native.class.getClassLoader() == ClassLoader.getSystemClassLoader()) {
    try {
      return nativeInstance();
    } catch (Throwable t) {
      return fastestJavaInstance();
    }
  } else {
    return fastestJavaInstance();
  }
}
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `CelebornConfSuite`
- `ConfigurationSuite`

Closes #2050 from SteNicholas/CELEBORN-1095.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 14:57:31 +08:00
SteNicholas
49ea881037
[MINOR] Remove unnecessary increment index of Master#timeoutDeadWorkers
### What changes were proposed in this pull request?

Remove unnecessary increment index of `Master#timeoutDeadWorkers`.

### Why are the changes needed?

Increment index of `Master#timeoutDeadWorkers` is unnecessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2027 from SteNicholas/timeout-dead-workers.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 22:18:39 +08:00
Fu Chen
b2412d0774 [CELEBORN-1022][TEST] Update log level from FATAL to ERROR for console output in unit tests
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

1. this is developer-friendly for debugging unit tests in IntelliJ IDEA, for example: Netty's memory leak reports are logged at the error level and won't cause unit tests to be marked as fatal.

```
23/10/09 09:57:26,422 ERROR [fetch-server-52-2] ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
	io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.lang.Thread.run(Thread.java:750)
```

2. this won't increase console output and affect the stability of CI.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1958 from cfmcgrady/ut-console-log-level.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 15:56:05 +08:00
Fu Chen
39ab731b85 [CELEBORN-875][FOLLOWUP] Enhance DataPushQueueSuiteJ for thread safety and prevent NullPointerException
### What changes were proposed in this pull request?

1. replaced the usage of `HashMap` with `ConcurrentHashMap` for `partitionBatchIdMap` to ensure thread safety during parallel data processing
2. put the partition id and batch id into the `partitionBatchIdMap` before adding the task to prevent the possibility of a NPE

### Why are the changes needed?

to fix NPE

https://github.com/apache/incubator-celeborn/actions/runs/5734532048/job/15540863715?pr=1785

```
xception in thread "DataPusher-0" java.lang.NullPointerException
	at org.apache.celeborn.client.write.DataPushQueueSuiteJ$1.pushData(DataPushQueueSuiteJ.java:121)
	at org.apache.celeborn.client.write.DataPusher$1.run(DataPusher.java:125)
Error: The operation was canceled.
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1789 from cfmcgrady/celeborn-875-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 21:52:53 +08:00
Fu Chen
f869ab25b6 [CELEBORN-857][TEST] Refine DataPushQueueSuiteJ
### What changes were proposed in this pull request?

1. This PR propose renaming the class `DataPushQueueSuitJ` to `DataPushQueueSuiteJ` in order to enable its integration with the test suite. This change is required to comply with our maven-surefire-plugin plugin rule

5f0295e9f3/pom.xml (L543-L551)

2. To fix a potential logic bug in the test, tasks within `DataPushQueue` may inadvertently be consumed by the `DataPusher`s built-in thread `DataPusher-${taskId}`, leading to test suite failures.

![截屏2023-07-31 下午12 08 06](https://github.com/apache/incubator-celeborn/assets/8537877/b7a294a5-a12b-474a-b43d-233998bc7f31)

![截屏2023-07-31 下午12 07 49](https://github.com/apache/incubator-celeborn/assets/8537877/c585ed00-0111-4aab-863a-e7984ed8a298)

### Why are the changes needed?

fix DataPushQueueSuiteJ bug

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1774 from cfmcgrady/refine-data-push-queue-suite.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-31 15:43:43 +08:00
Angerszhuuuu
be05ae37fe [CELEBORN-815] Remove unused ShuffleClient.readPartition
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1739 from AngersZhuuuu/CELEBORN-815.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 20:49:29 +08:00
Angerszhuuuu
5471a6afe5
[CELEBORN-804] ShuffleClient should cleanup shuffle infos when trigger unregisterShuffle
### What changes were proposed in this pull request?

After discussion, we make sure that `shuffleManager.unregisterShuffle()` will be triggered by Spark both in driver and executor. In this pr:

  1. Add shuffle client both in driver and executor side in ShuffleManager
  2. ShuffleClient call cleanupShuffle() when trigger `unregisterShuffle`.

This replaced https://github.com/apache/incubator-celeborn/pull/1719

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1726 from AngersZhuuuu/CELEBORN-804.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:50:18 +08:00
Cheng Pan
0db919403e Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…"
This reverts commit e56a8a8bed.
2023-07-19 15:08:45 +08:00
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
zky.zhoukeyong
10a1def512 [CELEBORN-802] Reuse DataPusher#idleQueue by pooling to avoid too many byte[] objects
### What changes were proposed in this pull request?
Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]```
objects in ```PushTask```.

### Why are the changes needed?
I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G:
![image](https://github.com/apache/incubator-celeborn/assets/948245/7b4ee4fa-7860-4ddb-b862-181a91748092)

After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number
of  PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M:
![image](https://github.com/apache/incubator-celeborn/assets/948245/981e8f70-52f8-4bb1-9f67-9a8b4f398392)

Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and Manual test.

Closes #1722 from waitinfuture/802.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:35:14 +08:00
zky.zhoukeyong
4b3a47c9db [CELEBORN-799] Limit total inflight push requests
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In case where worker instances is very large, say 1000, then before this PR total memory consumed
by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR
limits total inflight push requests, as 0.2.1-incubating does.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1720 from waitinfuture/799.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:17:24 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
xiyu.zk
381165d4e7
[CELEBORN-755] Support disable shuffle compression
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.

### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.

### Does this PR introduce _any_ user-facing change?
no.

Closes #1669 from kerwin-zk/celeborn-755.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-01 00:03:50 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
zky.zhoukeyong
6b82ecdfa0 [CELEBORN-712] Make appUniqueId a member of ShuffleClientImpl and refactor code
### What changes were proposed in this pull request?
Make appUniqueId a member of ShuffleClientImpl and remove applicationId from RPC messages across client side, so it won't cause compatibility issues.

### Why are the changes needed?
Currently Celeborn Client is bound to a single application id, so there's no need to pass applicationId around in many RPC messages in client side.

### Does this PR introduce _any_ user-facing change?
In some logs the application id will not be printed, which should not be a problem.

### How was this patch tested?
UTs.

Closes #1621 from waitinfuture/appid.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-25 21:37:16 +08:00
Angerszhuuuu
c1c46398d5 [CELEBORN-682] Master and client should handle blacklist worker and shutting down worker separately
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1594 from AngersZhuuuu/CELEBORN-682.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-16 18:29:03 +08:00
Cheng Pan
76533d7324
[CELEBORN-650][TEST] Upgrade scalatest and unify mockito version
### What changes were proposed in this pull request?

This PR upgrades

- `mockito` from 1.10.19 and 3.6.0 to 4.11.0
- `scalatest` from 3.2.3 to 3.2.16
- `mockito-scalatest` from 1.16.37 to 1.17.14

### Why are the changes needed?

Housekeeping, making test dependencies up-to-date and unified.

### Does this PR introduce _any_ user-facing change?

No, it only affects test.

### How was this patch tested?

Pass GA.

Closes #1562 from pan3793/CELEBORN-650.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-09 10:04:14 +08:00
Angerszhuuuu
cf308aa057
[CLEBORN-595] Refine code frame of CelebornConf (#1525) 2023-06-01 10:37:58 +08:00
Angerszhuuuu
62681ba85d
[CELEBORN-595] Rename and refactor the configuration doc. (#1501) 2023-05-30 15:14:12 +08:00
Angerszhuuuu
a22c61e479
[CELEBORN-582] Celeborn should handle InterruptedException during kill task properly (#1486) 2023-05-17 18:17:41 +08:00
Shuang
343f1e62d2
[CELEBORN-537][FOLLOWUP] Fix blacklist potentially lost failure workers (#1449) 2023-04-23 10:16:21 +08:00
Shuang
d68deecaaa
[CELEBORN-546][FLINK] Use autoIncrement partitionId replace encode(mapId, attemptId) for generating partitionId (#1447) 2023-04-22 16:33:22 +08:00
Shuang
62d60de8c5
[CELEBORN-537] Improve blacklist compute & minor fix for Flink (#1441)
[CELEBORN-537] improve blacklist compute & minor fix for flink
2023-04-20 18:30:10 +08:00
Ethan Feng
6378a386d0
[CELEBORN-530][REFACTOR] Move stream manager and memory manager to worker module. (#1439) 2023-04-20 10:17:26 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Keyong Zhou
cb19ed1c66
[CELEBORN-479][PERF] Refactor DataPushQueue.takePushTask to avoid busy wait (#1386) 2023-03-27 16:18:55 +08:00
Fei Wang
7c444cb0c5
[CELEBORN-474] Speed up ConcurrentHashMap#computeIfAbsent (#1383) 2023-03-26 09:41:59 +08:00
Shuang
89b3f3887d
[CELEBORN-356] [FLINK] Support release single partition resource (#1314) 2023-03-24 17:15:28 +08:00
Keyong Zhou
107868d4f1
[CELEBORN-441][FLINK] Move ShuffleTaskInfo to Flink Plugin (#1361) 2023-03-20 13:31:53 +08:00
zhongqiangchen
9dc1bc2b1c
[CELEBORN-367] [FLINK] Move pushdata functions used by mappartition from ShuffleClientImpl to FlinkShuffleClientImpl (#1295) 2023-03-02 18:50:38 +08:00
Keyong Zhou
7adf1fca41
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
2023-02-28 10:35:55 +08:00
Shuang
61065230bd
[CELEBORN-311] not retry when register for map partition occurs exception (#1246) 2023-02-21 10:16:10 +08:00
zhongqiangchen
b5dc106af8
[CELEBORN-291] optimize shuffleclientimpl creating client and pushdata for mappartition (#1224) 2023-02-17 19:07:19 +08:00
Angerszhuuuu
57f775a7e9
[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208) 2023-02-16 18:27:37 +08:00
Angerszhuuuu
4b6f7e4593
[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185) 2023-02-03 11:53:15 +08:00
Shuang
7162be2fae
[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149) 2023-01-31 18:53:36 +08:00
zy.jordan
c5be79ee3d
[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102) 2023-01-20 10:18:45 +08:00
Shuang
2ec06472fe
[CELEBORN-203] fix NPE when removeExpiredShuffle in LifecycleManager. (#1151) 2023-01-06 18:32:17 +08:00
Shuang
3b2be25a50
[CELEBORN-173] refactor minicluster and fix ut (#1147) 2023-01-05 20:39:19 +08:00
Cheng Pan
b8758a7cb6
[CELEBORN-181][TEST] Rename RssFunSuite to CelebornFunSuite (#1125) 2022-12-29 18:10:14 +08:00
Binjie Yang
63943cd5cc
[CELEBORN-147][IT]Extraction of common integration test cases (#1092) 2022-12-29 12:03:09 +08:00
Cheng Pan
ec371c0026
[CELEBORN-132] ShuffleClient should not implement Cloneable (#1077) 2022-12-14 10:04:39 +08:00
zhongqiangczq
60f6f87832
[CELEBORN-11] ShuffleClient supports MapPartition shuffle write:pushdata (#1036) 2022-12-08 12:31:47 +08:00