celeborn

Author	SHA1	Message	Date
zhaohehuhu	a2d3972318	[CELEBORN-1530] support MPU for S3 ### What changes were proposed in this pull request? as title ### Why are the changes needed? AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753) I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage. <img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2"> The above screenshot is the second test with 5000 mapper and reducer that I did. Closes #2830 from zhaohehuhu/dev-1021. Lead-authored-by: zhaohehuhu <luoyedeyi@163.com> Co-authored-by: He Zhao <luoyedeyi459@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-22 15:03:53 +08:00
Wang, Fei	ea6617c0d5	[CELEBORN-1521] Introduce celeborn-spi module for authentication extensions ### What changes were proposed in this pull request? Introduce celeborn-spi module for authentication extensions. ### Why are the changes needed? Address comments: https://github.com/apache/celeborn/pull/2632#issuecomment-2247132115 ### Does this PR introduce _any_ user-facing change? No, this interface has not been released. ### How was this patch tested? UT. Closes #2644 from turboFei/celeborn_spi. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2024-07-25 00:52:00 -07:00
zhaohehuhu	7a596bbed1	[CELEBORN-1469] Support writing shuffle data to OSS(S3 only) ### What changes were proposed in this pull request? as title ### Why are the changes needed? Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #2579 from zhaohehuhu/dev-0619. Authored-by: zhaohehuhu <luoyedeyi@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-07-24 11:59:15 +08:00
SteNicholas	e5f09ce4e0	[CELEBORN-1443] Remove ratis dependencies from common module ### What changes were proposed in this pull request? Remove ratis dependencies from common module. ### Why are the changes needed? Ratis is only depended on by the master module. Removing ratis dependencies from the common module reduces the size of the Celeborn client package. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2538 from SteNicholas/CELEBORN-1443. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-06-03 10:15:51 +08:00
SteNicholas	dd87419044	[CELEBORN-1380][FOLLOWUP] leveldbjni uses org.openlabtesting.leveldbjni to support linux aarch64 platform for leveldb via aarch64 profile ### What changes were proposed in this pull request? Dependency leveldbjni uses `org.openlabtesting.leveldbjni` to support linux aarch64 platform for leveldb via `aarch64` profile. Follow up #2476. ### Why are the changes needed? Celeborn worker could not start on arm arch devices if db backend is `LevelDB`, which should support leveldbjni on the aarch64 platform. aarch64 uses `org.openlabtesting.leveldbjni:leveldbjni-all.1.8`, and other platforms use `org.fusesource.leveldbjni:leveldbjni-all.1.8`. Meanwhile, because some hadoop dependencies packages are also depend on `org.fusesource.leveldbjni:leveldbjni-all`, but hadoop merge the similar change on trunk, details see [HADOOP-16614](https://issues.apache.org/jira/browse/HADOOP-16614), therefore it should exclude the dependency of `org.fusesource.leveldbjni` for these hadoop packages related. In addtion, `org.openlabtesting.leveldbjni` requires glibc version 3.4.21. Otherwise, there will be the following potential runtime risks: ``` # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x00007fad3630b12a, pid=62, tid=0x00007f93394ef700 # # JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 ) # Problematic frame: # C [libc.so.6+0x8412a] # # Core dump written. Default location: /data/service/celeborn/core or core.62 # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # --------------- T H R E A D --------------- Current thread (0x00007f9308001000): JavaThread "leveldb" [_thread_in_native, id=878, stack(0x00007f9338cf0000,0x00007f93394f0000)] siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007f97380d2220 ``` Backport: - https://github.com/apache/spark/pull/26636 - https://github.com/apache/spark/pull/31036 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2530 from SteNicholas/CELEBORN-1380. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-05-27 14:07:02 +08:00
SteNicholas	9110eab996	[CELEBORN-1380] leveldbjni uses org.openlabtesting.leveldbjni to support linux aarch64 platform for leveldb ### What changes were proposed in this pull request? Dependency leveldbjni uses `org.openlabtesting.leveldbjni` to support linux aarch64 platform for leveldb. ### Why are the changes needed? Celeborn worker could not start on arm arch devices if db backend is `LevelDB`, which should support leveldbjni on the aarch64 platform. aarch64 uses `org.openlabtesting.leveldbjni:leveldbjni-all.1.8`, and other platforms use `org.fusesource.leveldbjni:leveldbjni-all.1.8`. Meanwhile, because some hadoop dependencies packages are also depend on `org.fusesource.leveldbjni:leveldbjni-all`, but hadoop merge the similar change on trunk, details see [HADOOP-16614](https://issues.apache.org/jira/browse/HADOOP-16614), therefore it should exclude the dependency of `org.fusesource.leveldbjni` for these hadoop packages related. Backport: - https://github.com/apache/spark/pull/26636 - https://github.com/apache/spark/pull/31036 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2476 from SteNicholas/CELEBORN-1380. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2024-04-24 11:52:56 +08:00
Mridul Muralidharan	4400089708	[CELEBORN-1346] Add build changes and test resources for ssl support ### What changes were proposed in this pull request? Build changes and test resources for enabling SSL support. Please see #2416 for the consolidate PR with all the changes for reference. Note: I closed the older PR #2413 and reopened this one give the repo changes. ### Why are the changes needed? Build dependency updates and addition of test resources for use with tests. The specific tests leveraging these will be added in subsequent jiras linked off of CELEBORN-1343 Splitting it up into multiple PR's to reduce the review load. ### Does this PR introduce _any_ user-facing change? io.netty:netty-tcnative-boringssl-static is an additional dependency. org.bouncycastle:* are test dependencies which should have no user facing changes. ### How was this patch tested? The overall PR #2411 passes all tests, this is specifically pulling out the dependency changes and resources. Closes #2417 from mridulm/build-and-test-for-tls. Lead-authored-by: Mridul Muralidharan <mridul@gmail.com> Co-authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2024-03-26 21:50:54 +08:00
SteNicholas	d62f75fdc7	[MINOR] Unifiy license format of pom.xml ### What changes were proposed in this pull request? Unifiy license format of `pom.xml`. ### Why are the changes needed? There are different license formats among modules, which standard license format has indent before `~`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2408 from SteNicholas/maven-license-format. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-03-21 14:34:49 +08:00
sychen	2504b50dd2	[CELEBORN-1170] Upgrade snappy-java from 1.1.8.2 to 1.1.10.5 ### What changes were proposed in this pull request? ### Why are the changes needed? https://github.com/apache/incubator-celeborn/pull/2143 The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see https://scout.docker.com/vulnerabilities/id/CVE-2023-43642 https://scout.docker.com/vulnerabilities/id/CVE-2023-34455 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #2158 from cxzl25/CELEBORN-1170. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-12-14 22:28:32 +08:00
qinrui	04a1e90207	[CELEBORN-1122] Metrics supports json format ### What changes were proposed in this pull request? If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? Metrics supports JSON format ### How was this patch tested? Cluster test. Closes #2089 from suizhe007/CELEBORN-1122. Authored-by: qinrui <qr7972@gmail.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-12-06 09:24:28 +08:00
Mridul Muralidharan	3a41db360b	[CELEBORN-1006] Add support for Apache Hadoop 2.x in Celeborn build Add support for Apache Hadoop 2.x in Celeborn build Developers need to only specify their `hadoop.version`, and the build will pick the right profile internally based on the version to add the relevant dependencies. [hadoop-client-api](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api) and [hadoop-client-runtime](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-runtime) were introduced in hadoop 3.x, while hadoop 2.x had [hadoop-client](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client) Celeborn depends on the former, and so requires hadoop 3.x to build. Apache Spark dropped support for Hadoop 2.x only in the recent v3.5 ([SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452)). Given this, we have case where deployments on supported platforms like Spark 3.4 and older running on 2.x hadoop, will need to pull in hadoop 3.x just for Celeborn. This PR uses `hadoop-client` when `hadoop.version` is specified as 2.x - and preserves existing behavior when `hadoop.version` is 3.x Note - while using `hadoop-client` in 3.x is an option, hadoop community recommendation is to rely on `hadoop-client-api`/`hadoop-client-runtime`, hence making an effort to leverage that as much as possible. Adds support for using 2.x for hadoop.version Three combinations were tested: * Default, without overriding hadoop.version Dependencies: ``` $ build/mvn dependency:list 2>&1 \| grep hadoop \| sort \| uniq [INFO] org.apache.hadoop:hadoop-client-api:jar:3.2.4:compile [INFO] org.apache.hadoop:hadoop-client-runtime:jar:3.2.4:compile ``` Will update this section again based on test suite results (which are ongoing) * Setting hadoop.version to newer 3.3.0 explicitly Dependencies: ``` $ ARGS="-Pspark-3.1 -Dhadoop.version=3.3.0" ; build/mvn dependency:list $ARGS 2>&1 \| grep hadoop \| sort \| uniq [INFO] org.apache.hadoop:hadoop-client-api:jar:3.3.0:compile [INFO] org.apache.hadoop:hadoop-client-runtime:jar:3.3.0:compile ``` * Setting hadoop.version to older 2.10.0 Dependencies: ``` $ ARGS="-Pspark-3.1 -Dhadoop.version=2.10.0" ; build/mvn dependency:list $ARGS 2>&1 \| grep hadoop \| grep compile \| sort \| uniq [INFO] org.apache.hadoop:hadoop-auth:jar:2.10.0:compile -- module hadoop.auth (auto) [INFO] org.apache.hadoop:hadoop-client:jar:2.10.0:compile -- module hadoop.client (auto) [INFO] org.apache.hadoop:hadoop-common:jar:2.10.0:compile -- module hadoop.common (auto) [INFO] org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile -- module hadoop.hdfs.client (auto) [INFO] org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.10.0:compile -- module hadoop.mapreduce.client.app (auto) [INFO] org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.10.0:compile -- module hadoop.mapreduce.client.common (auto) [INFO] org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile -- module hadoop.mapreduce.client.core (auto) [INFO] org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.10.0:compile [INFO] org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.10.0:compile -- module hadoop.mapreduce.client.shuffle (auto) [INFO] org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:compile -- module hadoop.yarn.api (auto) [INFO] org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile -- module hadoop.yarn.common (auto) ``` For each of the case above, build/test passes for each of the `ARGS`. Closes #1936 from mridulm/main. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-09-25 20:15:02 +08:00
mingji	e0c00ecd38	[CELEBORN-839][MR] Support Hadoop MapReduce ### What changes were proposed in this pull request? 1. Map side merge and push. 2. Support hadoop2 & 3. 3. Reduce in-memory merge. 4. Integrate LifecycleManager to RmApplicationMaster. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Cluster. I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster. Hadoop 2.8.5 1TB Terasort, 8400 mappers, 1000 reducers Celeborn 81min vs MR shuffle 89min ![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d) ![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f) 1GB wordcount, 8 mappers, 8 reducers Celeborn 35s VS MR shuffle 38s ![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47) ![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877) Closes #1830 from FMX/CELEBORN-839. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-09-14 14:12:53 +08:00
Fu Chen	3fb896b11f	[CELEBORN-666] Define `protobuf-maven-plugin` in the root pom.xml ### What changes were proposed in this pull request? Define `protobuf-maven-plugin` in the root pom.xml ### Why are the changes needed? to fix ```bash build/mvn protobuf:compile -am -pl common ``` ``` [ERROR] No plugin found for prefix 'protobuf' in the current project and in the plugin groups [org.apache.maven.plugins, org.codehaus.mojo] available from the repositories [local (/Users/fchen/.m2/repository), apache.snapshots (https://repository.apache.org/snapshots), central (https://repo.maven.apache.org/maven2)] -> [Help 1] org.apache.maven.plugin.prefix.NoPluginFoundForPrefixException: No plugin found for prefix 'protobuf' in the current project and in the plugin groups [org.apache.maven.plugins, org.codehaus.mojo] available from the repositories [local (/Users/fchen/.m2/repository), apache.snapshots (https://repository.apache.org/snapshots), central (https://repo.maven.apache.org/maven2)] at org.apache.maven.plugin.prefix.internal.DefaultPluginPrefixResolver.resolve (DefaultPluginPrefixResolver.java:95) at org.apache.maven.lifecycle.internal.MojoDescriptorCreator.findPluginForPrefix (MojoDescriptorCreator.java:266) at org.apache.maven.lifecycle.internal.MojoDescriptorCreator.getMojoDescriptor (MojoDescriptorCreator.java:220) at org.apache.maven.lifecycle.internal.DefaultLifecycleTaskSegmentCalculator.calculateTaskSegments (DefaultLifecycleTaskSegmentCalculator.java:104) at org.apache.maven.lifecycle.internal.DefaultLifecycleTaskSegmentCalculator.calculateTaskSegments (DefaultLifecycleTaskSegmentCalculator.java:83) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:89) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:298) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293) at org.apache.maven.cli.MavenCli.main (MavenCli.java:196) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) [ERROR] [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoPluginFoundForPrefixException ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? tested locally. Closes #1579 from cfmcgrady/protobuf-plugin. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-12 19:46:46 +08:00
Fu Chen	ab449ffdd7	[CELEBORN-198] Fix the wrong configuration path of plugin protobuf-maven-plugin and … (#1146 )	2023-01-05 20:09:31 +08:00
Ethan Feng	dd02070e4b	[CELEBORN-83] Fix various bug when using HDFS as storage. 1. fix incompatibility between Hadoop 2 and Hadoop 3. 2. fix hdfs writer will never be called when there are no healthy disks. 3. fix an NPE when HDFS file writer close.	2022-11-30 19:33:18 +08:00
Cheng Pan	96e969f46e	[BUILD] Extract project.version to Maven Property (#772 )	2022-10-16 19:01:40 +08:00
Cheng Pan	ab16b4f101	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00
Keyong Zhou	a2d2379153	[DOC] Replace RSS with Celeborn in docs (#715 )	2022-10-06 10:37:46 +08:00
Cheng Pan	4880d78d6a	Extract spark tests and improve pom (#711 )	2022-10-04 10:23:26 +08:00
Keyong Zhou	fe3b5988f2	[REFACTOR] Change package name to org.apache.celeborn (#710 )	2022-10-02 18:10:29 +08:00
AngersZhuuuu	343caba83c	[ISSUE-656][FEATURE] Support get user quota from quota conf setting (#659 ) [ISSUE-656][FEATURE] Support get user quota from rssConf setting	2022-09-29 12:55:01 +08:00
Ethan Feng	b4654d788c	[ISSUE-607]Add map ids info for each PartitionLocation to enable filtering for m… (#619 )	2022-09-23 15:21:41 +08:00
Cheng Pan	4b42219595	Remove log4j1 (#501 )	2022-09-05 19:30:15 +08:00
nafiy	6d308eb4f2	[ISSUE-465][Bug] Common module scalatest style unit test don't actually run (#472 )	2022-08-28 18:52:39 +08:00
Ethan Feng	a4bab91453	[issue-332] support flush disk buffer to hdfs (#430 )	2022-08-23 21:04:45 +08:00
Cheng Pan	f1f4b894af	Build: Enhance build system (#349 )	2022-08-15 14:59:01 +08:00
Cheng Pan	d01ee81ee6	Bump Ratis 2.3.0 and related toolchains (#299 )	2022-08-04 21:59:42 +08:00
AngersZhuuuu	fe17914942	Refactor pom import issue (#277 )	2022-07-25 17:49:55 +08:00
mingji	d4d8eb3838	update pom version.	2022-06-24 14:28:42 +08:00
Ethan Feng	1113f437c6	[FEATURE] Remove dependency on spark-tags from common module (#126 ) (#128 )	2022-05-31 15:24:08 +08:00
nafiy	491f89bbb5	[FEATURE]Add metrics source for JVM and CPU (#125 ) * Add metrics source for JVM and CPU * Fix scala style issue	2022-05-30 13:26:54 +08:00
Ethan Feng	86adc0d244	[Feature]Add metrics documentation and grafana dashboard. (#117 )	2022-05-20 12:12:41 +08:00
Ethan Feng	baa2836216	Add metrics: (#85 ) 1.shuffle fetch send data time. 2.open stream time. 3.memory critical count.	2022-04-02 15:05:27 +08:00
Ethan Feng	9ad8254b0a	AQE support. (#67 )	2022-04-01 20:19:01 +08:00
wangshengjie123	b2a6091b55	[Feature] Make log4j2 as optional in case to we can update log4j2.xml to change log level (#56 )	2022-03-08 22:33:06 +08:00
Ethan Feng	356a1952e4	Multi Client Support (#47 ) Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2022-01-29 22:28:06 +08:00
zky.zhoukeyong	ba5920acde	Initial Commit for RSS	2021-12-28 20:57:35 +08:00

37 Commits