celeborn

Author	SHA1	Message	Date
zky.zhoukeyong	9ec223edd7	[CELEBORN-803] Increase default timeout for commit files ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s. It's more reasonable because commit files costs relatively long time. In my testing with tough disks, 30s timeout with 2 retires is not enough. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1724 from waitinfuture/803. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 22:31:36 +08:00
caojiaqing	d64e0091f1	[CELEBORN-785] Add worker side partition hard split threshold ### What changes were proposed in this pull request? Add a configuration `celeborn.worker.shuffle.partitionSplit.max` to ensure that, in soft mode, individual partition files are limited to a size smaller than the configured value ### Why are the changes needed? In soft mode, there may be situations where individual partition files are exceptionally large, which can result in excessively long sort times in skewed scenarios. ### Does this PR introduce _any_ user-facing change? `celeborn.worker.shuffle.partitionSplit.max` defalut value 2g ### How was this patch tested? none Closes #1701 from JQ-Cao/785. Authored-by: caojiaqing <caojiaqing@bilibili.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 14:14:41 +08:00
zky.zhoukeyong	7a47fae230	[CELEBORN-786] Change default flush threads ### What changes were proposed in this pull request? This PR changes default values of the following configs: \|config\|previous default value\|new default value\| \|----\|----\|----\| \|celeborn.worker.flusher.threads\|2\|16\| \|celeborn.worker.flusher.ssd.threads\|8\|16\| ### Why are the changes needed? If disk type is not specified, ```celeborn.worker.flusher.threads``` will be used. Recently many users use SSD for Celeborn workers without specifying disk type, and 2 flush threads is far from leveraging the power of SSD. ### Does this PR introduce _any_ user-facing change? Yes, default configs are changed. ### How was this patch tested? Passes GA. Closes #1703 from waitinfuture/786. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 13:09:29 +08:00
mingji	d0ecf83fec	[CELEBORN-764] Fix celeborn on HDFS might clean using app directories ### What changes were proposed in this pull request? Make Celeborn leader clean expired app dirs on HDFS when an application is Lost. ### Why are the changes needed? If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories. This will cause using app directories to be deleted unexpectedly. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1678 from FMX/CELEBORN-764. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 23:11:50 +08:00
Angerszhuuuu	693172d0bd	[CELEBORN-751] Rename remain rss related class name and filenames etc ### What changes were proposed in this pull request? Rename remain rss related class name and filenames etc... ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1664 from AngersZhuuuu/CELEBORN-751. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-04 10:20:08 +08:00
mingji	40760ede3a	[CELEBORN-568] Support storage type selection ### What changes were proposed in this pull request? 1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now. 2. Add new buffer size for HDFS file writers. 3. Worker support empty working dirs. ### Why are the changes needed? Support HDFS only scenario. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1619 from FMX/CELEBORN-568. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-27 18:07:08 +08:00
Shuang	da85347330	[CELEBORN-675] Fix decode heartbeat message ### What changes were proposed in this pull request? Give Heartbeat one byte message and skip this byte when decode. ### Why are the changes needed? Heartbeat message may split in to two netty buffer, then the `empty buffer` (which don't need actually, but need keep) be wrong removed, then decodeNext would throw NPE. see ``` java while (headerBuf.readableBytes() < HEADER_SIZE) { ByteBuf next = buffers.getFirst(); int toRead = Math.min(next.readableBytes(), HEADER_SIZE - headerBuf.readableBytes()); headerBuf.writeBytes(next, toRead); if (!next.isReadable()) { buffers.removeFirst().release(); } } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT & MANUAL Closes #1589 from RexXiong/CELEBORN-675. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>	2023-06-14 14:37:13 +08:00
Angerszhuuuu	e18a5ea769	[CELEBORN-624] StorageManager should only remove expired app dirs (#1531 )	2023-06-02 11:33:33 +08:00
Angerszhuuuu	cf308aa057	[CLEBORN-595] Refine code frame of CelebornConf (#1525 )	2023-06-01 10:37:58 +08:00
Angerszhuuuu	6d5dd50915	[CELEBORN-595][FOLLOWUP] Fix change version to 0.3.0. (#1522 )	2023-05-30 20:12:56 +08:00
Angerszhuuuu	62681ba85d	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
zhongqiangchen	f117cff776	[CELEBORN-618] [FLINK] worker side adds partition split configuration options (#1520 )	2023-05-30 14:13:31 +08:00
Angerszhuuuu	d244f44518	[CELEBORN-593] Refine some RPC related default configurations (#1498 )	2023-05-19 18:23:12 +08:00
Ethan Feng	7015d2463a	[CELEBORN-583] Merge pooled memory allocators. (#1490 )	2023-05-18 10:37:30 +08:00
Angerszhuuuu	791d72d45f	[CELEBORN-590] Remove hadoop prefix of WORKER_WORKING_DIR (#1494 )	2023-05-17 17:57:27 +08:00
Angerszhuuuu	7c6cb2f3bb	[CELEBORN-588] Remove test conf's category (#1491 )	2023-05-17 17:37:28 +08:00
zhongqiangchen	5769c3fdc7	[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457 )	2023-05-10 19:35:51 +08:00
Angerszhuuuu	181c1bfcd6	[CELEBORN-524][PERF] CongestionControl call too much ChannelsLimiter onTrim cause CPU stuck or occupy too much CPU cause no cpu for handlePushData (#1428 )	2023-04-21 15:44:56 +08:00
Ethan Feng	9cccfc9872	[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407 )	2023-04-13 10:37:47 +08:00
Angerszhuuuu	cad2836e85	[CELEBORN-505] Fix typo of SHUFFLE_CHUCK_SIZE (#1411 )	2023-04-04 19:15:30 +08:00
zhongqiangchen	cd92c423cd	[CELEBORN-475] Support extra tags for prometheus metrics (#1385 ) [CELEBORN-475] Support extra tags for prometheus metrics	2023-03-28 21:22:28 +08:00
Ethan Feng	0ebad677d7	[CELEBORN-434] Add constrain about memory manager's parameters. (#1356 )	2023-03-17 15:14:03 +08:00
Angerszhuuuu	4b334df7a6	[CELEBORN-399] Make fileSorterExecutors thread num can be customized (#1325 )	2023-03-10 21:10:43 +08:00
Keyong Zhou	dcedf7b0a9	[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287 )	2023-03-02 18:31:50 +08:00
Keyong Zhou	7adf1fca41	[CELEBORN-295] Optimize data push (#1232 ) * [CELEBORN-295] Add double buffer for sort pusher	2023-02-28 10:35:55 +08:00
Ethan Feng	0c8bb83114	[CELEBORN-234] Implement buffer stream. (#1221 )	2023-02-17 17:38:36 +08:00
Rex(Hui) An	bb113ec9be	[CELEBORN-207] Support network congestion control (#1066 )	2023-02-07 12:06:18 +08:00
Angerszhuuuu	4b6f7e4593	[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185 )	2023-02-03 11:53:15 +08:00
Ethan Feng	a239f9f284	[CELEBORN-228]Refactor PartitionFileSorter to avoid specific JDK dependency. (#1168 )	2023-01-16 20:06:47 +08:00
zy.jordan	bb96700415	[CELEBORN-223] The default rpc thread num of pushServer/replicateServer/fetchServer should be the number of total of Flusher's thread (#1163 )	2023-01-16 12:03:46 +08:00
Keyong Zhou	fa7ba43136	[CELEBORN-225] Add global default configuration for number of flusher… (#1165 )	2023-01-14 13:20:44 +08:00
zy.jordan	19197b9190	[CELEBORN-214] Push/Replicate/Fetch io threads default value is 16 (#1158 )	2023-01-10 17:46:56 +08:00
Ethan Feng	5aa959a335	[CELEBORN-157] Change prefix of configurations to celeborn. (#1104 )	2022-12-21 15:17:28 +08:00
Keyong Zhou	2f0682265e	[CELEBORN-119] Add timeout for pushdata (#1097 )	2022-12-20 20:40:42 +08:00
nafiy	c931663e5f	[CELEBORN-110][REFACTOR] Notify critical error after collecting a certain number of non-critical error (#1055 )	2022-12-16 15:47:36 +08:00
Ethan Feng	acfaf59ab3	[CELEBORN-91] Refactor memory tracker to support read buffer. (#1038 ) * [CELEBORN-91] Refactor memory tracker to support read buffer.	2022-12-05 15:38:43 +08:00
Keyong Zhou	f8bb2cd47d	[CELEBORN-12]Retry on CommitFile request (#1011 )	2022-11-26 20:56:24 +08:00
Cheng Pan	d7be6006e7	Migrate network related conf to structured conf system (#875 ) * Migrate network related conf to structured conf system * migrate * fix * fix * worker * fix * nit * review * nit	2022-10-28 10:45:52 +08:00
Angerszhuuuu	d283cca4e1	[ISSUE-869][REFACTOR] Migrate partition size/sorter related conf to Celeborn ConfigEntity (#870 )	2022-10-27 16:49:55 +08:00
Angerszhuuuu	26dcc118c6	[ISSUE-871][REFACTOR] Migrate Worker conf to Celeborn Configuration System (#873 ) * [ISSUE-871][REFACTOR] Migrate Worker conf to Celeborn Configuration System	2022-10-27 15:35:29 +08:00
Angerszhuuuu	399236c880	[ISSUE-849][REFACTOR] Migrate master and common Celeborn Configuration System (#850 )	2022-10-26 17:09:27 +08:00
nafiy	e44e8c9610	[ISSUE-828][REFACTOR] Migrate memory tracker related configs to ConfigEntry (#831 ) * [ISSUE-828][REFACTOR] Migrate memory tracker related configs to ConfigEntry * Fix based on review * update doc * resolve review feedback * fix * Fix based on review * fix based on review	2022-10-25 21:16:53 +08:00
Cheng Pan	e71c0228aa	Migrate columnar shuffle configurations to ConfigEntry (#844 )	2022-10-25 14:26:11 +08:00
Cheng Pan	8d7d397e71	Fix Configuration page and polish naming (#838 ) * Fix Configuration page and polish naming * nit * nit * comment	2022-10-24 12:46:25 +08:00
Ethan Feng	392a252baa	[FOLLOWUP][ISSUE-813]Update doc and fix typo. (#825 )	2022-10-22 23:02:22 +08:00
nafiy	1a8a36e8fe	[ISSUE-812][Refactor] Migrate metrics system related configs to ConfigEntry (#821 )	2022-10-21 13:57:58 +08:00
Ethan Feng	5c761a8df3	[ISSUE-813][Refactor] Refactor flusher configurations. (#813 ) * Refactor flusher configurations. * Refactor flusher configurations. * Update. * remove brackets. * update docs. * rename. * update. * update docs. * update. * update. * update. * update. * update. * update. * update. * format. * update. * update.	2022-10-20 15:23:17 +08:00
AngersZhuuuu	23c65a27a9	[ISSUE-798][REFACTOR] Migrate worker-recover related conf to ConfigEntry (#799 )	2022-10-19 16:42:00 +08:00
Cheng Pan	cb07cf62c0	Auto generate configuration docs (#794 )	2022-10-19 10:50:22 +08:00

49 Commits