celeborn

Author	SHA1	Message	Date
Fu Chen	39ab731b85	[CELEBORN-875][FOLLOWUP] Enhance `DataPushQueueSuiteJ` for thread safety and prevent `NullPointerException` ### What changes were proposed in this pull request? 1. replaced the usage of `HashMap` with `ConcurrentHashMap` for `partitionBatchIdMap` to ensure thread safety during parallel data processing 2. put the partition id and batch id into the `partitionBatchIdMap` before adding the task to prevent the possibility of a NPE ### Why are the changes needed? to fix NPE https://github.com/apache/incubator-celeborn/actions/runs/5734532048/job/15540863715?pr=1785 ``` xception in thread "DataPusher-0" java.lang.NullPointerException at org.apache.celeborn.client.write.DataPushQueueSuiteJ$1.pushData(DataPushQueueSuiteJ.java:121) at org.apache.celeborn.client.write.DataPusher$1.run(DataPusher.java:125) Error: The operation was canceled. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1789 from cfmcgrady/celeborn-875-followup. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-02 21:52:53 +08:00
Fu Chen	f869ab25b6	[CELEBORN-857][TEST] Refine DataPushQueueSuiteJ ### What changes were proposed in this pull request? 1. This PR propose renaming the class `DataPushQueueSuitJ` to `DataPushQueueSuiteJ` in order to enable its integration with the test suite. This change is required to comply with our maven-surefire-plugin plugin rule `5f0295e9f3/pom.xml (L543-L551)` 2. To fix a potential logic bug in the test, tasks within `DataPushQueue` may inadvertently be consumed by the `DataPusher`s built-in thread `DataPusher-${taskId}`, leading to test suite failures. ![截屏2023-07-31 下午12 08 06](https://github.com/apache/incubator-celeborn/assets/8537877/b7a294a5-a12b-474a-b43d-233998bc7f31) ![截屏2023-07-31 下午12 07 49](https://github.com/apache/incubator-celeborn/assets/8537877/c585ed00-0111-4aab-863a-e7984ed8a298) ### Why are the changes needed? fix DataPushQueueSuiteJ bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1774 from cfmcgrady/refine-data-push-queue-suite. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-31 15:43:43 +08:00
Angerszhuuuu	be05ae37fe	[CELEBORN-815] Remove unused ShuffleClient.readPartition ### What changes were proposed in this pull request? As title. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA. Closes #1739 from AngersZhuuuu/CELEBORN-815. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-20 20:49:29 +08:00
Angerszhuuuu	5471a6afe5	[CELEBORN-804] ShuffleClient should cleanup shuffle infos when trigger unregisterShuffle ### What changes were proposed in this pull request? After discussion, we make sure that `shuffleManager.unregisterShuffle()` will be triggered by Spark both in driver and executor. In this pr: 1. Add shuffle client both in driver and executor side in ShuffleManager 2. ShuffleClient call cleanupShuffle() when trigger `unregisterShuffle`. This replaced https://github.com/apache/incubator-celeborn/pull/1719 ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1726 from AngersZhuuuu/CELEBORN-804. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-19 20:50:18 +08:00
Cheng Pan	0db919403e	Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…" This reverts commit `e56a8a8bed`.	2023-07-19 15:08:45 +08:00
zky.zhoukeyong	e56a8a8bed	[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean… …up client ### What changes were proposed in this pull request? Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response, client calls ```unregisterShuffle``` for cleanup. ### Why are the changes needed? Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo): ![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005) After this PR, the number of PartitionLocation objects decreases to 275 thousands ![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc) This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1719 from waitinfuture/798. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:14:10 +08:00
zky.zhoukeyong	10a1def512	[CELEBORN-802] Reuse DataPusher#idleQueue by pooling to avoid too many byte[] objects ### What changes were proposed in this pull request? Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]``` objects in ```PushTask```. ### Why are the changes needed? I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G: ![image](https://github.com/apache/incubator-celeborn/assets/948245/7b4ee4fa-7860-4ddb-b862-181a91748092) After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number of PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M: ![image](https://github.com/apache/incubator-celeborn/assets/948245/981e8f70-52f8-4bb1-9f67-9a8b4f398392) Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and Manual test. Closes #1722 from waitinfuture/802. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 16:35:14 +08:00
zky.zhoukeyong	4b3a47c9db	[CELEBORN-799] Limit total inflight push requests ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In case where worker instances is very large, say 1000, then before this PR total memory consumed by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR limits total inflight push requests, as 0.2.1-incubating does. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1720 from waitinfuture/799. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 16:17:24 +08:00
Angerszhuuuu	693172d0bd	[CELEBORN-751] Rename remain rss related class name and filenames etc ### What changes were proposed in this pull request? Rename remain rss related class name and filenames etc... ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1664 from AngersZhuuuu/CELEBORN-751. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-04 10:20:08 +08:00
xiyu.zk	381165d4e7	[CELEBORN-755] Support disable shuffle compression ### What changes were proposed in this pull request? Support to decide whether to compress shuffle data through configuration. ### Why are the changes needed? Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation. ### Does this PR introduce _any_ user-facing change? no. Closes #1669 from kerwin-zk/celeborn-755. Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-01 00:03:50 +08:00
Fu Chen	adbd38a926	[CELEBORN-726][FOLLOWUP] Update data replication terminology from `master/slave` to `primary/replica` in the codebase ### What changes were proposed in this pull request? As title ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #1639 from cfmcgrady/primary-replica. Lead-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 17:07:26 +08:00
Angerszhuuuu	3985a5cbd7	[CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment ### What changes were proposed in this pull request? Unify all blacklist related code and comment ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 16:28:03 +08:00
zky.zhoukeyong	6b82ecdfa0	[CELEBORN-712] Make appUniqueId a member of ShuffleClientImpl and refactor code ### What changes were proposed in this pull request? Make appUniqueId a member of ShuffleClientImpl and remove applicationId from RPC messages across client side, so it won't cause compatibility issues. ### Why are the changes needed? Currently Celeborn Client is bound to a single application id, so there's no need to pass applicationId around in many RPC messages in client side. ### Does this PR introduce _any_ user-facing change? In some logs the application id will not be printed, which should not be a problem. ### How was this patch tested? UTs. Closes #1621 from waitinfuture/appid. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-25 21:37:16 +08:00
Angerszhuuuu	c1c46398d5	[CELEBORN-682] Master and client should handle blacklist worker and shutting down worker separately ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1594 from AngersZhuuuu/CELEBORN-682. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-16 18:29:03 +08:00
Cheng Pan	76533d7324	[CELEBORN-650][TEST] Upgrade scalatest and unify mockito version ### What changes were proposed in this pull request? This PR upgrades - `mockito` from 1.10.19 and 3.6.0 to 4.11.0 - `scalatest` from 3.2.3 to 3.2.16 - `mockito-scalatest` from 1.16.37 to 1.17.14 ### Why are the changes needed? Housekeeping, making test dependencies up-to-date and unified. ### Does this PR introduce _any_ user-facing change? No, it only affects test. ### How was this patch tested? Pass GA. Closes #1562 from pan3793/CELEBORN-650. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-09 10:04:14 +08:00
Angerszhuuuu	cf308aa057	[CLEBORN-595] Refine code frame of CelebornConf (#1525 )	2023-06-01 10:37:58 +08:00
Angerszhuuuu	62681ba85d	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
Angerszhuuuu	a22c61e479	[CELEBORN-582] Celeborn should handle InterruptedException during kill task properly (#1486 )	2023-05-17 18:17:41 +08:00
Shuang	343f1e62d2	[CELEBORN-537][FOLLOWUP] Fix blacklist potentially lost failure workers (#1449 )	2023-04-23 10:16:21 +08:00
Shuang	d68deecaaa	[CELEBORN-546][FLINK] Use autoIncrement partitionId replace encode(mapId, attemptId) for generating partitionId (#1447 )	2023-04-22 16:33:22 +08:00
Shuang	62d60de8c5	[CELEBORN-537] Improve blacklist compute & minor fix for Flink (#1441 ) [CELEBORN-537] improve blacklist compute & minor fix for flink	2023-04-20 18:30:10 +08:00
Ethan Feng	6378a386d0	[CELEBORN-530][REFACTOR] Move stream manager and memory manager to worker module. (#1439 )	2023-04-20 10:17:26 +08:00
cxzl25	13f772e0c0	[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size	2023-04-14 20:45:25 +08:00
Keyong Zhou	cb19ed1c66	[CELEBORN-479][PERF] Refactor DataPushQueue.takePushTask to avoid busy wait (#1386 )	2023-03-27 16:18:55 +08:00
Fei Wang	7c444cb0c5	[CELEBORN-474] Speed up ConcurrentHashMap#computeIfAbsent (#1383 )	2023-03-26 09:41:59 +08:00
Shuang	89b3f3887d	[CELEBORN-356] [FLINK] Support release single partition resource (#1314 )	2023-03-24 17:15:28 +08:00
Keyong Zhou	107868d4f1	[CELEBORN-441][FLINK] Move ShuffleTaskInfo to Flink Plugin (#1361 )	2023-03-20 13:31:53 +08:00
zhongqiangchen	9dc1bc2b1c	[CELEBORN-367] [FLINK] Move pushdata functions used by mappartition from ShuffleClientImpl to FlinkShuffleClientImpl (#1295 )	2023-03-02 18:50:38 +08:00
Keyong Zhou	7adf1fca41	[CELEBORN-295] Optimize data push (#1232 ) * [CELEBORN-295] Add double buffer for sort pusher	2023-02-28 10:35:55 +08:00
Shuang	61065230bd	[CELEBORN-311] not retry when register for map partition occurs exception (#1246 )	2023-02-21 10:16:10 +08:00
zhongqiangchen	b5dc106af8	[CELEBORN-291] optimize shuffleclientimpl creating client and pushdata for mappartition (#1224 )	2023-02-17 19:07:19 +08:00
Angerszhuuuu	57f775a7e9	[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208 )	2023-02-16 18:27:37 +08:00
Angerszhuuuu	4b6f7e4593	[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185 )	2023-02-03 11:53:15 +08:00
Shuang	7162be2fae	[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149 )	2023-01-31 18:53:36 +08:00
zy.jordan	c5be79ee3d	[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102 )	2023-01-20 10:18:45 +08:00
Shuang	2ec06472fe	[CELEBORN-203] fix NPE when removeExpiredShuffle in LifecycleManager. (#1151 )	2023-01-06 18:32:17 +08:00
Shuang	3b2be25a50	[CELEBORN-173] refactor minicluster and fix ut (#1147 )	2023-01-05 20:39:19 +08:00
Cheng Pan	b8758a7cb6	[CELEBORN-181][TEST] Rename RssFunSuite to CelebornFunSuite (#1125 )	2022-12-29 18:10:14 +08:00
Binjie Yang	63943cd5cc	[CELEBORN-147][IT]Extraction of common integration test cases (#1092 )	2022-12-29 12:03:09 +08:00
Cheng Pan	ec371c0026	[CELEBORN-132] ShuffleClient should not implement Cloneable (#1077 )	2022-12-14 10:04:39 +08:00
zhongqiangczq	60f6f87832	[CELEBORN-11] ShuffleClient supports MapPartition shuffle write:pushdata (#1036 )	2022-12-08 12:31:47 +08:00
zhongqiangczq	d3d40f730c	[CELEBORN-106] flink-plugin supports shufflewrite:OutputGate (#1051 )	2022-12-08 11:24:37 +08:00
Shuang	e2196e9383	[CELEBORN-56] [ISSUE-945] handle map partition mapper end (#1003 )	2022-12-07 21:09:02 +08:00
Shuang	f3f104870c	[CELEBORN-75] Initialize flink plugin module (#1027 )	2022-12-07 15:53:00 +08:00
nafiy	13e1e24035	[CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration (#1031 ) * [CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration	2022-12-01 18:39:56 +08:00
zhongqiangczq	898d1126a6	[CELEBORN-11] ShuffleClient supports MapPartition shuffle write: send handshake/regionstart/regionfinish (#1035 )	2022-12-01 11:20:55 +08:00
Keyong Zhou	9214b82181	[CELEBORN-68] Client might fetch incorrect data chunk (#1010 )	2022-11-26 18:06:06 +08:00
Ethan Feng	93dbf3f8b1	[CELEBORN-67] Revert "Fix fetch incorrect data chunk" related commits (#1006 ) * Revert "[CELEBORN-50][FOLLOWUP] Channel inactive may cause new client use old stream id to fetch data (#999)" This reverts commit `1e8f6dc5e8`. * Revert "[CELEBORN-50] Channel inActive may cause new client use old stream id to fetch data cause IllegalStateException. (#1000)" This reverts commit `f1c4d675d6`. * Revert "[CELEBORN-49] Deadlock when kill worker in shuffle read (#998)" This reverts commit `0be4b3399c`. * Revert "[CELEBORN-47][IMPROVEMENT] Refine logs about tracking fetch chunk (#995)" This reverts commit `2b05228871`. * Revert "[BUG] Fix fetch incorrect data chunk (#926)" This reverts commit `6f043f8a` * Revert "[ISSUE-925][FOLLOWUP] Refactor class name of RetryingChunkReceiveCallback (#954)" This reverts commit `64e8ebf1`	2022-11-25 20:57:47 +08:00
leesf	0b8376e2c7	Cleanup some code (#943 )	2022-11-11 13:58:39 +08:00
Ethan Feng	6f043f8ae9	[BUG] Fix fetch incorrect data chunk (#926 )	2022-11-09 22:31:39 +08:00

1 2

74 Commits