celeborn

Author	SHA1	Message	Date
mingji	17cfbd7dc7	[CELEBORN-948][DOC] fix quick start doc about failed to submit flink wordcount ### What changes were proposed in this pull request? Update the script to start word count demo. ### Why are the changes needed? A user reported that he could not run the demo while following the quick start docs. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Cluster. Closes #1880 from FMX/CELEBORN-948. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-09-05 17:44:16 +08:00
zky.zhoukeyong	a42ec85a6e	[CELEBORN-943][PERF] Pre-create CelebornInputStreams in CelebornShuffleReader ### What changes were proposed in this pull request? This PR fixes performance degradation when Spark's coalescePartitions takes effect caused by RPC latency. ### Why are the changes needed? I encountered a performance degradation when testing tpcds 10T q10: \|\|Time\| \|---\|---\| \|ESS\|14s\| \|Celeborn\| 24s\| After digging into it I found out that q10 triggers partition coalescence: ![image](https://github.com/apache/incubator-celeborn/assets/948245/0b4745da-8d57-4661-a35d-683d97f56e1d) As I configured `spark.sql.adaptive.coalescePartitions.initialPartitionNum` to 1000, `CelebornShuffleReader` will call `shuffleClient.readPartition` sequentially 1000 times, causing the delay. This PR optimizes by calling `shuffleClient.readPartition` in parallel. After this PR q10 time becomes 14s. ### Does this PR introduce _any_ user-facing change? No, but introduced a new client side configuration `celeborn.client.streamCreatorPool.threads` which defaults to 32. ### How was this patch tested? TPCDS 1T and passes GA. Closes #1876 from waitinfuture/943. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-09-04 21:46:11 +08:00
zhongqiang.czq	b66eaff880	[CELEBORN-627][FLINK] Support split partitions ### What changes were proposed in this pull request? In MapPartiitoin, datas are split into regions. 1. Unlike ReducePartition whose partition split can occur on pushing data to keep MapPartition data ordering, PartitionSplit only be done on the time of sending PushDataHandShake or RegionStart messages (As shown in the following image). That's to say that the partition split only appear at the beginnig of a region but not inner a region. > Notice: if the client side think that it's failed to push HandShake or RegionStart messages. but the worker side can still receive normal HandShake/RegionStart message. After client revive succss, it don't push any messages to old partition, so the worker having the old partition will create a empty file. After committing files, the worker will return empty commitids. That's to say that empty file will be filterd after committing files and ReduceTask will not read any empty files. ![image](https://github.com/apache/incubator-celeborn/assets/96606293/468fd660-afbc-42c1-b111-6643f5c1e944) 2. PushData/RegioinFinish don't care the following cases: - Diskfull - ExceedPartitionSplitThreshold - Worker ShuttingDown so if one of the above three conditions appears, PushData and RegionFinish cant still do as normal. Workers should consider the ShuttingDown case and try best to wait all the regions finished before shutting down. if PushData or RegionFinish failed like network timeout and so on, then MapTask will failed and start another attempte maptask. ![image](https://github.com/apache/incubator-celeborn/assets/96606293/db9f9166-2085-4be1-b09e-cf73b469c55b) 3. how shuffle read supports partition split? ReduceTask should get split paritions by order and open the stream by partition epoc orderly ### Why are the changes needed? PartiitonSplit is not supported by MapPartition from now. There still a risk that a partition file'size is too large to store the file on worker disk. To avoid this risk, this pr introduces partition split in shuffle read and shuffle write. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and manual TPCDS test Closes #1550 from FMX/CELEBORN-627. Lead-authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <ethanfeng@apache.org> Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>	2023-09-01 19:25:51 +08:00
mingji	2ee6e305f1	[CELEBORN-941] fix incorrect deploy doc ### What changes were proposed in this pull request? Fix the incorrect deploy doc about using HDFS only. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Just docs. Closes #1874 from FMX/CELEBORN-941. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-08-31 18:54:27 +08:00
SteNicholas	baaddb8ee8	[CELEBORN-822][DOC] Introduce a quick start guide for running Apache Flink with Apache Celeborn ### What changes were proposed in this pull request? Introduce a quick start guide for running Apache Flink with Apache Celeborn to help Flink users to run with Celeborn. ### Why are the changes needed? There is no quick start guide for running Apache Flink with Apache Celeborn. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? None. Closes #1868 from SteNicholas/CELEBORN-822. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-30 21:38:03 +08:00
mingji	505ba804c7	[CELEBORN-752] Support read local shuffle file for spark ### What changes were proposed in this pull request? For spark clusters, support read local shuffle file if Celeborn is co-deployed with yarn node managers. This PR help to reduce the number of active connections. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? GA and cluster. The performance is identical whether you enable local reader, but the active connection number may vary according to your connections per peer. <img width="951" alt="截屏2023-08-16 20 20 14" src="https://github.com/apache/incubator-celeborn/assets/4150993/9106e731-28fc-4e78-9c05-ae6a269d249a"> The active connection number changed from 3745 to 2894. This PR will help to improve cluster stability. Closes #1812 from FMX/CELEBORN-752. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-30 18:52:18 +08:00
SteNicholas	92777c3ff2	[CELEBORN-927][DOC] Correct celeborn.metrics.conf..sink.csv.class configuration example for a CSV sink ### What changes were proposed in this pull request? Correct `celeborn.metrics.conf..sink.csv.class` configuration example for a CSV sink. ### Why are the changes needed? `celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink is wrong, which value should be `org.apache.celeborn.common.metrics.sink.CsvSink`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? None. Closes #1865 from SteNicholas/CELEBORN-927. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-30 16:11:03 +08:00
zhouyifan279	dc5bdfadcc	[CELEBORN-923][DOC] docs/developers/overview.md has a broken link ### What changes were proposed in this pull request? Fix a broken link in docs/developers/overview.md. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally tested. Closes #1845 from zhouyifan279/upgrade-page-link. Authored-by: zhouyifan279 <zhouyifan279@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-08-28 12:07:43 +08:00
Keyong Zhou	1d04a23289	[CELEBORN-920] Worker sends its load to Master through heartbeat ### What changes were proposed in this pull request? Adding a flag indicating high load in the worker's heartbeat allows the master to better schedule the workers ### Why are the changes needed? In our production environment, there is a node with abnormally high load, but the master is not aware of this situation. It assigned numerous jobs to this node, and as a result, the stability of these jobs has been affected. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #1840 from JQ-Cao/920. Lead-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: caojiaqing <caojiaqing@bilibili.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-26 13:58:37 +08:00
lishiyucn	57a35ca349	[CELEBORN-498] Add new config for DfsPartitionReader's chunk size ### What changes were proposed in this pull request? As title ### Why are the changes needed? Make `celeborn.shuffle.chunk.size` worker side only config. Add a new client side config `celeborn.client.fetch.dfsReadChunkSize` for DfsPartitionReader ### Does this PR introduce _any_ user-facing change? Yes, the chunks size of DfsPartitionReader is changed from client side config `celeborn.shuffle.chunk.size` to `celeborn.client.fetch.dfsReadChunkSize` ### How was this patch tested? Passes GA Closes #1834 from lishiyucn/main. Lead-authored-by: lishiyucn <675590586@qq.com> Co-authored-by: shiyu li <675590586@qq.com> Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-24 21:31:34 +08:00
zwangsheng	80948e89ae	[CELEBORN-909][DOC] Mention `celeborn.worker.directMemoryRatioToResume` default value changed in main/0.4 ### What changes were proposed in this pull request? As title ### Why are the changes needed? After #1829 we set `celeborn.worker.directMemoryRatioToResume` default value from `0.5` to `0.7`. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? No Closes #1836 from zwangsheng/CELEBORN-909. Lead-authored-by: zwangsheng <2213335496@qq.com> Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-24 21:08:38 +08:00
zwangsheng	2ffd6d7b28	[CELEBORN-905] Redraw the flowchart backpressure.svg after worker pause logic is reconstructed ### What changes were proposed in this pull request? Add a new `backpressure.svg` to replace the out-date one. ### Why are the changes needed? After #1811, we refactor celeborn worker back-pressure logic, we should add new flowchart for user to understand. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? ![backpressure](https://github.com/apache/incubator-celeborn/assets/52876270/34f3f4b8-28cf-4cce-88a4-e6fee1886d94) Closes #1829 from zwangsheng/CELEBORN-905. Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-24 11:51:01 +08:00
Angerszhuuuu	17de30009b	[CELEBORN-847] Support use RESTful API to trigger worker exit and exitImmediately ### What changes were proposed in this pull request? As title ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1768 from AngersZhuuuu/CELEBORN-847. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-15 20:04:26 +08:00
e	4a4a37ed17	[MINOR] Fix typo in CelebornConf ### What changes were proposed in this pull request? Fix typo in CelebornConf ### Why are the changes needed? Fix typo in CelebornConf ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Passing GA Closes #1813 from jiaoqingbo/typo-conf. Authored-by: e <1178404354@qq.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-15 10:32:08 +08:00
Fu Chen	efc334a6aa	[CELEBORN-877][FOLLOWUP][DOC] Expand 'note' blocks by default in the docs sbt.md ### What changes were proposed in this pull request? As title ### Why are the changes needed? As title ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1806 from cfmcgrady/sbt-docs-followup. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-11 21:54:24 +08:00
Fu Chen	516bdc7e08	[CELEBORN-877][DOC] Document on SBT ### What changes were proposed in this pull request? As title ### Why are the changes needed? As title ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test Closes #1795 from cfmcgrady/sbt-docs. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-08-11 12:17:55 +08:00
zwangsheng	63df84593e	[CELEBORN-883][WORKER] Optimized configuration checks during MemoryManager initialization <!-- Thanks for sending a pull request! Here are some tips for you: - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'. - Be sure to keep the PR description updated to reflect all changes. - Please write your PR title to summarize what this PR proposes. - If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? 1. Expose the config check logic during `MemoryManager#initialization` in the user configuration doc. 2. Add Preconditions Error Message 3. Add unit test to make sure that part of the logic isn't altered by mistake ### Why are the changes needed? User-friendly ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Add Unit Test Closes #1801 from zwangsheng/CELEBORN-883. Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: zwangsheng <2213335496@qq.com>	2023-08-11 10:46:00 +08:00
Kerwin Zhang	4fb3f31a2d	[CELEBORN-870][FOLLOWUP][DOC] Document on usage together with Gluten (#1793 )	2023-08-08 10:37:13 +08:00
zky.zhoukeyong	6ea1ee2ec4	[CELEBORN-152] Add config to limit max workers when offering slots ### What changes were proposed in this pull request? Add config to limit max workers when offering slots, the config can be set both in server side and client side. Celeborn will choose the smaller positive configs from client and master. ### Why are the changes needed? For large Celeborn clusters, users may want to limit the number of workers that a shuffle can spread, reasons are: 1. One worker failure will not affect all applications 2. One huge shuffle will not affect all applications 3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than spreading across a large number of workers, say 1000, because the network connections in pushing data is `number of ShuffleClient` * `number of allocated Workers` The recommended number of Workers should depend on workload and Worker hardware, and this can be configured per application, so it's relatively flexible. ### Does this PR introduce _any_ user-facing change? No, added a new configuration. ### How was this patch tested? Added ITs and passes GA. Closes #1790 from waitinfuture/152. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-07 10:13:53 +08:00
mingji	efc9a875e9	[CELEBORN-863] Persist committed file infos to support worker recovery ### What changes were proposed in this pull request? Support worker recovery if the worker has crashed when workers has enabled graceful shutdown.. 1. Persist committed file info to LevelDB. 2. Load levelDB when worker started. 3. Clean expired file infos in LevelDB. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? GA and cluster. After testing on a cluster I found that 8k file infos will consume about 2MB of disk space, disk space can be reclaimed if shuffle is expired shortly. Closes #1779 from FMX/CELEBORN-863. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-04 23:58:47 +08:00
xiyu.zk	35fe63e4a9	[CELEBORN-870][DOC] Document on usage together with Gluten ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1784 from kerwin-zk/gluten_celeborn. Lead-authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com> Co-authored-by: Kerwin Zhang <xiyu.zk@alibaba-inc.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-04 11:32:13 +08:00
zky.zhoukeyong	3ee0674058	[CELEBORN-869][FOLLOWUP][DOC] Document on Integrating Celeborn ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1788 from waitinfuture/869-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-02 18:17:17 +08:00
Keyong Zhou	8c473c038b	[CELEBORN-869][DOC] Document on Integrating Celeborn ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1787 from waitinfuture/869. Lead-authored-by: Keyong Zhou <waitinfuture@gmail.com> Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-02 17:22:41 +08:00
zky.zhoukeyong	bee8648421	[CELEBORN-864][DOC] Document on blacklist ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1782 from waitinfuture/864. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-01 21:23:55 +08:00
zky.zhoukeyong	3593adf12d	[CELEBORN-860][DOC] Document on ShuffleClient ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1778 from waitinfuture/860-1. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-31 20:07:20 +08:00
zky.zhoukeyong	37a9c633b3	[CELEBORN-853][DOC] Document on LifecycleManager ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1775 from waitinfuture/853. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-31 17:36:42 +08:00
zky.zhoukeyong	b36ea39001	[CELEBORN-834][DOC] Add fault tolerant document ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1769 from waitinfuture/834. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-28 10:39:08 +08:00
zky.zhoukeyong	41509d6e7e	[CELEBORN-849][DOC] Document on Master ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1772 from waitinfuture/849. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-27 22:09:43 +08:00
Angerszhuuuu	5cb73ed3b4	[CELEBORN-851] Mention Celeborn 0.4 server requires 0.3 or above clients ### What changes were proposed in this pull request? As title ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1770 from AngersZhuuuu/CELEBORN-851. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-27 18:07:44 +08:00
Angerszhuuuu	0db2150731	[CELEBORN-808] Remove unnecessary RssShuffleManager in 0.4.0 ### What changes were proposed in this pull request? As title ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1731 from AngersZhuuuu/CELEBORN-808. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-27 17:47:44 +08:00
Angerszhuuuu	bacfb54447	[CELEBORN-832] Support use RESTful API to trigger worker decommission ### What changes were proposed in this pull request? As title ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1759 from AngersZhuuuu/CELEBORN-832. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-27 15:40:14 +08:00
Fu Chen	c5ddf9b2ca	[CELEBORN-822][FOLLOWUP] Format the example code in the docs/README.md ### What changes were proposed in this pull request? As title ### Why are the changes needed? make it more clarity and readability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #1763 from cfmcgrady/celeborn-822-followup. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-26 20:02:13 +08:00
zky.zhoukeyong	b8cdf36b40	[CELEBORN-831][DOC] Add traffic control document ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1754 from waitinfuture/831. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-24 19:51:02 +08:00
zky.zhoukeyong	070d8bc0f8	[CELEBORN-826][DOC] Add storage document ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1752 from waitinfuture/826. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-24 16:12:42 +08:00
Angerszhuuuu	00c36fda99	[CELEBORN-828] Merge Monitoring to Development doc ### What changes were proposed in this pull request? As title <img width="1610" alt="截屏2023-07-24 上午11 34 43" src="https://github.com/apache/incubator-celeborn/assets/46485123/ba1b040b-9ea4-4c93-b055-75a469365ff2"> ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1751 from AngersZhuuuu/CELEBORN-828. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-24 15:37:32 +08:00
Cheng Pan	fa79b263a0	[CELEBORN-827] Eliminate unnecessary chunksBeingTransferred calculation ### What changes were proposed in this pull request? Eliminate `chunksBeingTransferred` calculation when `celeborn.shuffle.io.maxChunksBeingTransferred` is not configured ### Why are the changes needed? I observed high CPU usage on `ChunkStreamManager#chunksBeingTransferred` calculation. We can eliminate the method call if no threshold is configured, and investigate how to improve the method itself in the future. <img width="1947" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/412c6a41-c0ce-440c-ae99-4424cb8702d3"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI and Review. Closes #1749 from pan3793/CELEBORN-827. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-24 15:31:57 +08:00
zky.zhoukeyong	8e849645eb	[CELEBORN-824][DOC] Add PushData document ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1747 from waitinfuture/824. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-24 10:38:46 +08:00
zky.zhoukeyong	27521547f0	[CELEBORN-823][DOC] Add Celeborn architecture document ### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1746 from waitinfuture/823. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-22 23:57:22 +08:00
zky.zhoukeyong	fb2af146bf	[CELEBORN-822][DOC] Add quick start guide ### What changes were proposed in this pull request? As title. ![image](https://github.com/apache/incubator-celeborn/assets/948245/e2e96131-26be-497f-9f11-e8b5e215a15d) ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1745 from waitinfuture/822. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-22 21:39:41 +08:00
Angerszhuuuu	f15c2a7a68	[CELEBORN-814] Merge upgrade doc to Deployment tab and add TOC ### What changes were proposed in this pull request? As title <img width="1643" alt="截屏2023-07-20 下午12 01 06" src="https://github.com/apache/incubator-celeborn/assets/46485123/d8822003-602f-4fe8-9634-ff25c0367cb1"> ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1738 from AngersZhuuuu/CELEBORN-814. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-20 14:06:12 +08:00
zky.zhoukeyong	6a5e3ed794	[CELEBORN-812] Cleanup SendBufferPool if idle for long ### What changes were proposed in this pull request? Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than `celeborn.client.push.sendbufferpool.expireTimeout`. ### Why are the changes needed? Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large and will not be reused in the future, it wastes memory and causes GC. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual tests. Closes #1735 from waitinfuture/812-1. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-20 00:34:55 +08:00
Angerszhuuuu	14c6e5719f	[CELEBORN-811] Refine monitoring doc ### What changes were proposed in this pull request? Refine monitoring doc 1. Remove unnecessary left side navigator 2. Add TOC in right side 3. fix list indentation Before ![celeborn apache org_docs_latest_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/885da0e5-f2f9-41ba-a9fe-257e46e76a78) After ![127 0 0 1_8000_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/8cb3fc60-0a2e-4134-8edb-dd0fe434be60) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1734 from AngersZhuuuu/CELEBORN-811. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-19 20:53:21 +08:00
onebox-li	405b2801fa	[CELEBORN-810] Fix some typos and grammar ### What changes were proposed in this pull request? Fix some typos and grammar ### Why are the changes needed? Ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually test Closes #1733 from onebox-li/fix-typo. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-19 18:35:38 +08:00
Cheng Pan	0db919403e	Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…" This reverts commit `e56a8a8bed`.	2023-07-19 15:08:45 +08:00
zky.zhoukeyong	1109e2c8f4	[CELEBORN-803][FOLLOWUP] Make ```rpcAskTimeout``` default to 60s ### What changes were proposed in this pull request? As title. ### Why are the changes needed? Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```, so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1725 from waitinfuture/803-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 23:53:52 +08:00
zky.zhoukeyong	9ec223edd7	[CELEBORN-803] Increase default timeout for commit files ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s. It's more reasonable because commit files costs relatively long time. In my testing with tough disks, 30s timeout with 2 retires is not enough. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1724 from waitinfuture/803. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 22:31:36 +08:00
zky.zhoukeyong	e56a8a8bed	[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean… …up client ### What changes were proposed in this pull request? Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response, client calls ```unregisterShuffle``` for cleanup. ### Why are the changes needed? Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo): ![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005) After this PR, the number of PartitionLocation objects decreases to 275 thousands ![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc) This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1719 from waitinfuture/798. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:14:10 +08:00
zky.zhoukeyong	95119b1e4b	[CELEBORN-799][FOLLOWUP] Fix doc of `celeborn.client.push.maxReqsInFlight.total` …Flight.total``` ### What changes were proposed in this pull request? Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164 ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA. Closes #1723 from waitinfuture/799-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:01:03 +08:00
zky.zhoukeyong	4b3a47c9db	[CELEBORN-799] Limit total inflight push requests ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In case where worker instances is very large, say 1000, then before this PR total memory consumed by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR limits total inflight push requests, as 0.2.1-incubating does. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1720 from waitinfuture/799. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 16:17:24 +08:00
zky.zhoukeyong	a7bbbd05c4	[CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf ### What changes were proposed in this pull request? 1. Decrease writeTime metric sampling frequency to improve perf 2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4 ### Why are the changes needed? Following are test cases case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data Following are e2e time of shuffle write stage \|\|Sort pusher before\|Sort pusher after\|Hash pusher before\|Hash pusher after\| \|----\|----\|----\|----\|-----\| \|case1\|4.4min\|4.1min\|4.4min\|3.9min\| \|case2\|9.1min\|8.4min\|9.7min\|8.5min\| ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1718 from waitinfuture/797. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-14 20:51:50 +08:00

1 2 3 4 5

203 Commits