celeborn

Author	SHA1	Message	Date
Angerszhuuuu	6623309237	[CELEBORN-1143][BUG] SortBasedPusher pushData should inc memory spill metrics ### What changes were proposed in this pull request? SortBasedPusher `pushData` should inc memory spill metrics ### Why are the changes needed? Make metrics more acurate ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #2117 from AngersZhuuuu/CELEBORN-1143. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-11-28 16:26:00 +08:00
Erik.fang	aee41555c6	[CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure ### What changes were proposed in this pull request? Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission. Design doc: https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit ### Why are the changes needed? Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it ### Does this PR introduce _any_ user-facing change? a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature ### How was this patch tested? two UTs are attached, and we also tested it in Ant Group's Dev spark cluster Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure. Lead-authored-by: Erik.fang <fmerik@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-11-26 16:47:58 +08:00
SteNicholas	df40a28959	[CELEBORN-1032][FOLLOWUP] Use scheduleWithFixedDelay instead of scheduleAtFixedRate in threads pool of master and worker ### What changes were proposed in this pull request? Use `scheduleWithFixedDelay` instead of `scheduleAtFixedRate` in thread pool of Celeborn Master and Worker. ### Why are the changes needed? Follow up #1970. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2048 from SteNicholas/CELEBORN-1032. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-27 11:20:28 +08:00
Mridul Muralidharan	eb382b018c	[CELEBORN-1072] Fix misc error prone reports found Fix misc error prone reports. As detailed in the jira, they are: * Reference equality of boxed primitive types: see [BoxedPrimitiveEquality](https://errorprone.info/bugpattern/BoxedPrimitiveEquality) * Calling run directly - since use is legitimate, mark it as ignore. See: [DoNotCall](https://errorprone.info/bugpattern/DoNotCall) * `Ignore` test instead of catching `AssertionError` and ignoring it. See: [AssertionFailureIgnored](https://errorprone.info/bugpattern/AssertionFailureIgnored) Fix misc error prone reports. No Unit tests Closes #2019 from mridulm/fix-misc-issues. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-23 11:15:10 +08:00
Fu Chen	b2412d0774	[CELEBORN-1022][TEST] Update log level from `FATAL` to `ERROR` for console output in unit tests ### What changes were proposed in this pull request? As title ### Why are the changes needed? 1. this is developer-friendly for debugging unit tests in IntelliJ IDEA, for example: Netty's memory leak reports are logged at the error level and won't cause unit tests to be marked as fatal. ``` 23/10/09 09:57:26,422 ERROR [fetch-server-52-2] ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: Created at: io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140) io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.lang.Thread.run(Thread.java:750) ``` 2. this won't increase console output and affect the stability of CI. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1958 from cfmcgrady/ut-console-log-level. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-09 15:56:05 +08:00
zky.zhoukeyong	6a5e3ed794	[CELEBORN-812] Cleanup SendBufferPool if idle for long ### What changes were proposed in this pull request? Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than `celeborn.client.push.sendbufferpool.expireTimeout`. ### Why are the changes needed? Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large and will not be reused in the future, it wastes memory and causes GC. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual tests. Closes #1735 from waitinfuture/812-1. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-20 00:34:55 +08:00
zky.zhoukeyong	10a1def512	[CELEBORN-802] Reuse DataPusher#idleQueue by pooling to avoid too many byte[] objects ### What changes were proposed in this pull request? Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]``` objects in ```PushTask```. ### Why are the changes needed? I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G: ![image](https://github.com/apache/incubator-celeborn/assets/948245/7b4ee4fa-7860-4ddb-b862-181a91748092) After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number of PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M: ![image](https://github.com/apache/incubator-celeborn/assets/948245/981e8f70-52f8-4bb1-9f67-9a8b4f398392) Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and Manual test. Closes #1722 from waitinfuture/802. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 16:35:14 +08:00
Fu Chen	90ba9f3e87	[CELEBORN-783][FOLLOWUP] Private member updates and cleanup in `SortBasedPusher` ### What changes were proposed in this pull request? As title ### Why are the changes needed? https://github.com/apache/incubator-celeborn/pull/1699#discussion_r1259137323 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1704 from cfmcgrady/insert-record-followup. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 23:08:42 +08:00
Fu Chen	e47ec10cef	[CELEBORN-783] Revise the conditions for the `SortBasedPusher#insertRecord` method ### What changes were proposed in this pull request? As title ### Why are the changes needed? [comment](`7adf1fca41 (r121138008)`) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Closes #1699 from cfmcgrady/insert-record. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 11:36:29 +08:00
Angerszhuuuu	693172d0bd	[CELEBORN-751] Rename remain rss related class name and filenames etc ### What changes were proposed in this pull request? Rename remain rss related class name and filenames etc... ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1664 from AngersZhuuuu/CELEBORN-751. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-04 10:20:08 +08:00
Angerszhuuuu	4c67325a3d	[CELEBORN-720][SPARK] Correct metric peakExecutionMemory of SortBasedShuffleWriter ### What changes were proposed in this pull request? Currently SortBasedShuffleWriter won't update peakMemoryUsedBytes, this pr support this. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1632 from AngersZhuuuu/CELEBORN-720. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-27 18:40:06 +08:00
zky.zhoukeyong	6b82ecdfa0	[CELEBORN-712] Make appUniqueId a member of ShuffleClientImpl and refactor code ### What changes were proposed in this pull request? Make appUniqueId a member of ShuffleClientImpl and remove applicationId from RPC messages across client side, so it won't cause compatibility issues. ### Why are the changes needed? Currently Celeborn Client is bound to a single application id, so there's no need to pass applicationId around in many RPC messages in client side. ### Does this PR introduce _any_ user-facing change? In some logs the application id will not be printed, which should not be a problem. ### How was this patch tested? UTs. Closes #1621 from waitinfuture/appid. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-25 21:37:16 +08:00
Cheng Pan	6b64b1de9c	[CELEBORN-648][SPARK] Improve perf of SendBufferPool and logs about memory ### What changes were proposed in this pull request? - Replace index-based item access with an iterator for LinkedList. - Always try to remove a buffer if SendBufferPool does not have a matched candidate, this change makes the total buffer number from `capacity+N-1` to `capacity` in worst cases. - Some logs and code polish. ### Why are the changes needed? Improve performance and logs, reduce memory consumption. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1560 from pan3793/CELEBORN-648. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-09 09:45:27 +08:00
Angerszhuuuu	cf308aa057	[CLEBORN-595] Refine code frame of CelebornConf (#1525 )	2023-06-01 10:37:58 +08:00
Angerszhuuuu	a22c61e479	[CELEBORN-582] Celeborn should handle InterruptedException during kill task properly (#1486 )	2023-05-17 18:17:41 +08:00
Keyong Zhou	7adf1fca41	[CELEBORN-295] Optimize data push (#1232 ) * [CELEBORN-295] Add double buffer for sort pusher	2023-02-28 10:35:55 +08:00
Keyong Zhou	e47f1e33b0	[CELEBORN-55][FOLLOWUP] Code refine (#1175 )	2023-01-20 16:22:47 +08:00
zy.jordan	c5be79ee3d	[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102 )	2023-01-20 10:18:45 +08:00
AngersZhuuuu	0fdb19065a	[ISSUE-841][REFACTOR] Migrate shuffle client side conf to Celeborn Configuration System (#842 )	2022-10-24 20:48:48 +08:00
AngersZhuuuu	a773c8e6db	[ISSUE-820][Refactor] Rename RssConf to CelebornConf (#826 )	2022-10-20 20:13:13 +08:00
AngersZhuuuu	8344479df1	[ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class (#822 ) * [ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class Co-authored-by: Ethan Feng <ethan.aquarius.fmx@gmail.com>	2022-10-20 18:10:59 +08:00
Cheng Pan	ea67f4e060	Introduce categories to ConfigEntry and migrate configurations (#775 )	2022-10-17 16:56:54 +08:00
Cheng Pan	96e969f46e	[BUILD] Extract project.version to Maven Property (#772 )	2022-10-16 19:01:40 +08:00
Cheng Pan	805b8e6aba	[BUILD] Unify name style for module celeborn-client-spark-common (#748 )	2022-10-10 12:40:21 +08:00
Cheng Pan	ab16b4f101	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00

25 Commits