Commit Graph

37 Commits

Author SHA1 Message Date
zhaohehuhu
a2d3972318 [CELEBORN-1530] support MPU for S3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753)

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.

<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">

The above screenshot is the second test with 5000 mapper and reducer that I did.

Closes #2830 from zhaohehuhu/dev-1021.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 15:03:53 +08:00
Wang, Fei
ea6617c0d5 [CELEBORN-1521] Introduce celeborn-spi module for authentication extensions
### What changes were proposed in this pull request?
Introduce celeborn-spi module for authentication extensions.

### Why are the changes needed?
Address comments: https://github.com/apache/celeborn/pull/2632#issuecomment-2247132115

### Does this PR introduce _any_ user-facing change?
No, this interface has not been released.

### How was this patch tested?

UT.

Closes #2644 from turboFei/celeborn_spi.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2024-07-25 00:52:00 -07:00
zhaohehuhu
7a596bbed1 [CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #2579 from zhaohehuhu/dev-0619.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-07-24 11:59:15 +08:00
SteNicholas
e5f09ce4e0 [CELEBORN-1443] Remove ratis dependencies from common module
### What changes were proposed in this pull request?

Remove ratis dependencies from common module.

### Why are the changes needed?

Ratis is only depended on by the master module. Removing ratis dependencies from the common module reduces the size of the Celeborn client package.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2538 from SteNicholas/CELEBORN-1443.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-06-03 10:15:51 +08:00
SteNicholas
dd87419044
[CELEBORN-1380][FOLLOWUP] leveldbjni uses org.openlabtesting.leveldbjni to support linux aarch64 platform for leveldb via aarch64 profile
### What changes were proposed in this pull request?

Dependency leveldbjni uses `org.openlabtesting.leveldbjni` to support linux aarch64 platform for leveldb via `aarch64` profile.

Follow up #2476.

### Why are the changes needed?

Celeborn worker could not start on arm arch devices if db backend is `LevelDB`, which should support leveldbjni on the aarch64 platform.

aarch64 uses `org.openlabtesting.leveldbjni:leveldbjni-all.1.8`, and other platforms use `org.fusesource.leveldbjni:leveldbjni-all.1.8`. Meanwhile, because some hadoop dependencies packages are also depend on `org.fusesource.leveldbjni:leveldbjni-all`, but hadoop merge the similar change on trunk, details see
[HADOOP-16614](https://issues.apache.org/jira/browse/HADOOP-16614), therefore it should exclude the dependency of `org.fusesource.leveldbjni` for these hadoop packages related.

In addtion, `org.openlabtesting.leveldbjni` requires glibc version 3.4.21. Otherwise, there will be the following potential runtime risks:

```
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fad3630b12a, pid=62, tid=0x00007f93394ef700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x8412a]
#
# Core dump written. Default location: /data/service/celeborn/core or core.62
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f9308001000):  JavaThread "leveldb" [_thread_in_native, id=878, stack(0x00007f9338cf0000,0x00007f93394f0000)]

siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007f97380d2220
```

Backport:

- https://github.com/apache/spark/pull/26636
- https://github.com/apache/spark/pull/31036

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2530 from SteNicholas/CELEBORN-1380.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-27 14:07:02 +08:00
SteNicholas
9110eab996
[CELEBORN-1380] leveldbjni uses org.openlabtesting.leveldbjni to support linux aarch64 platform for leveldb
### What changes were proposed in this pull request?

Dependency leveldbjni uses `org.openlabtesting.leveldbjni` to support linux aarch64 platform for leveldb.

### Why are the changes needed?

Celeborn worker could not start on arm arch devices if db backend is `LevelDB`, which should support leveldbjni on the aarch64 platform.

aarch64 uses `org.openlabtesting.leveldbjni:leveldbjni-all.1.8`, and other platforms use `org.fusesource.leveldbjni:leveldbjni-all.1.8`. Meanwhile, because some hadoop dependencies packages are also depend on `org.fusesource.leveldbjni:leveldbjni-all`, but hadoop merge the similar change on trunk, details see
[HADOOP-16614](https://issues.apache.org/jira/browse/HADOOP-16614), therefore it should exclude the dependency of `org.fusesource.leveldbjni` for these hadoop packages related.

Backport:

- https://github.com/apache/spark/pull/26636
- https://github.com/apache/spark/pull/31036

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2476 from SteNicholas/CELEBORN-1380.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-04-24 11:52:56 +08:00
Mridul Muralidharan
4400089708
[CELEBORN-1346] Add build changes and test resources for ssl support
### What changes were proposed in this pull request?

Build changes and test resources for enabling SSL support.
Please see #2416 for the consolidate PR with all the changes for reference.

Note: I closed the older PR #2413 and reopened this one give the repo changes.

### Why are the changes needed?

Build dependency updates and addition of test resources for use with tests.
The specific tests leveraging these will be added in subsequent jiras linked off of CELEBORN-1343
Splitting it up into multiple PR's to reduce the review load.

### Does this PR introduce _any_ user-facing change?

io.netty:netty-tcnative-boringssl-static is an additional dependency.
org.bouncycastle:* are test dependencies which should have no user facing changes.

### How was this patch tested?
The overall PR #2411 passes all tests, this is specifically pulling out the dependency changes and resources.

Closes #2417 from mridulm/build-and-test-for-tls.

Lead-authored-by: Mridul Muralidharan <mridul@gmail.com>
Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-03-26 21:50:54 +08:00
SteNicholas
d62f75fdc7 [MINOR] Unifiy license format of pom.xml
### What changes were proposed in this pull request?

Unifiy license format of `pom.xml`.

### Why are the changes needed?

There are different license formats among modules, which standard license format has indent before `~`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2408 from SteNicholas/maven-license-format.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-21 14:34:49 +08:00
sychen
2504b50dd2 [CELEBORN-1170] Upgrade snappy-java from 1.1.8.2 to 1.1.10.5
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/2143

The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-43642
https://scout.docker.com/vulnerabilities/id/CVE-2023-34455

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2158 from cxzl25/CELEBORN-1170.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 22:28:32 +08:00
qinrui
04a1e90207 [CELEBORN-1122] Metrics supports json format
### What changes were proposed in this pull request?
If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Metrics supports JSON format

### How was this patch tested?
Cluster test.

Closes #2089 from suizhe007/CELEBORN-1122.

Authored-by: qinrui <qr7972@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-06 09:24:28 +08:00
Mridul Muralidharan
3a41db360b
[CELEBORN-1006] Add support for Apache Hadoop 2.x in Celeborn build
Add support for Apache Hadoop 2.x in Celeborn build
Developers need to only specify their `hadoop.version`, and the build will pick the right profile internally based on the version to add the relevant dependencies.

[hadoop-client-api](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api) and [hadoop-client-runtime](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-runtime) were introduced in hadoop 3.x, while hadoop 2.x had [hadoop-client](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client)
Celeborn depends on the former, and so requires hadoop 3.x to build.

Apache Spark dropped support for Hadoop 2.x only in the recent v3.5 ([SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452)). Given this, we have case where deployments on supported platforms like Spark 3.4 and older running on 2.x hadoop, will need to pull in hadoop 3.x just for Celeborn.

This PR uses `hadoop-client` when `hadoop.version` is specified as 2.x - and preserves existing behavior when `hadoop.version` is 3.x

Note - while using `hadoop-client` in 3.x is an option, hadoop community recommendation is to rely on `hadoop-client-api`/`hadoop-client-runtime`, hence making an effort to leverage that as much as possible.

Adds support for using 2.x for hadoop.version

Three combinations were tested:

* Default, without overriding hadoop.version

Dependencies:
```
$ build/mvn dependency:list 2>&1 | grep hadoop | sort | uniq
[INFO]    org.apache.hadoop:hadoop-client-api:jar:3.2.4:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.2.4:compile
```

Will update this section again based on test suite results (which are ongoing)

* Setting hadoop.version to newer 3.3.0 explicitly

Dependencies:
```
$ ARGS="-Pspark-3.1 -Dhadoop.version=3.3.0" ; build/mvn dependency:list $ARGS 2>&1 | grep hadoop | sort | uniq
[INFO]    org.apache.hadoop:hadoop-client-api:jar:3.3.0:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.3.0:compile
```

* Setting hadoop.version to older 2.10.0

Dependencies:
```
$ ARGS="-Pspark-3.1 -Dhadoop.version=2.10.0" ; build/mvn dependency:list $ARGS 2>&1 | grep hadoop | grep compile | sort | uniq
[INFO]    org.apache.hadoop:hadoop-auth:jar:2.10.0:compile -- module hadoop.auth (auto)
[INFO]    org.apache.hadoop:hadoop-client:jar:2.10.0:compile -- module hadoop.client (auto)
[INFO]    org.apache.hadoop:hadoop-common:jar:2.10.0:compile -- module hadoop.common (auto)
[INFO]    org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile -- module hadoop.hdfs.client (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.10.0:compile -- module hadoop.mapreduce.client.app (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.10.0:compile -- module hadoop.mapreduce.client.common (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile -- module hadoop.mapreduce.client.core (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.10.0:compile
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.10.0:compile -- module hadoop.mapreduce.client.shuffle (auto)
[INFO]    org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:compile -- module hadoop.yarn.api (auto)
[INFO]    org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile -- module hadoop.yarn.common (auto)
```

For each of the case above, build/test passes for each of the `ARGS`.

Closes #1936 from mridulm/main.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-25 20:15:02 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
Fu Chen
3fb896b11f [CELEBORN-666] Define protobuf-maven-plugin in the root pom.xml
### What changes were proposed in this pull request?

Define `protobuf-maven-plugin` in the root pom.xml

### Why are the changes needed?

to fix

```bash
build/mvn protobuf:compile -am -pl common
```

```
[ERROR] No plugin found for prefix 'protobuf' in the current project and in the plugin groups [org.apache.maven.plugins, org.codehaus.mojo] available from the repositories [local (/Users/fchen/.m2/repository), apache.snapshots (https://repository.apache.org/snapshots), central (https://repo.maven.apache.org/maven2)] -> [Help 1]
org.apache.maven.plugin.prefix.NoPluginFoundForPrefixException: No plugin found for prefix 'protobuf' in the current project and in the plugin groups [org.apache.maven.plugins, org.codehaus.mojo] available from the repositories [local (/Users/fchen/.m2/repository), apache.snapshots (https://repository.apache.org/snapshots), central (https://repo.maven.apache.org/maven2)]
    at org.apache.maven.plugin.prefix.internal.DefaultPluginPrefixResolver.resolve (DefaultPluginPrefixResolver.java:95)
    at org.apache.maven.lifecycle.internal.MojoDescriptorCreator.findPluginForPrefix (MojoDescriptorCreator.java:266)
    at org.apache.maven.lifecycle.internal.MojoDescriptorCreator.getMojoDescriptor (MojoDescriptorCreator.java:220)
    at org.apache.maven.lifecycle.internal.DefaultLifecycleTaskSegmentCalculator.calculateTaskSegments (DefaultLifecycleTaskSegmentCalculator.java:104)
    at org.apache.maven.lifecycle.internal.DefaultLifecycleTaskSegmentCalculator.calculateTaskSegments (DefaultLifecycleTaskSegmentCalculator.java:83)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:89)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:298)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoPluginFoundForPrefixException
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

tested locally.

Closes #1579 from cfmcgrady/protobuf-plugin.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-12 19:46:46 +08:00
Fu Chen
ab449ffdd7
[CELEBORN-198] Fix the wrong configuration path of plugin protobuf-maven-plugin and … (#1146) 2023-01-05 20:09:31 +08:00
Ethan Feng
dd02070e4b
[CELEBORN-83] Fix various bug when using HDFS as storage.
1. fix incompatibility between Hadoop 2 and Hadoop 3.
2. fix hdfs writer will never be called when there are no healthy disks.
3. fix an NPE when HDFS file writer close.
2022-11-30 19:33:18 +08:00
Cheng Pan
96e969f46e
[BUILD] Extract project.version to Maven Property (#772) 2022-10-16 19:01:40 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00
Cheng Pan
4880d78d6a
Extract spark tests and improve pom (#711) 2022-10-04 10:23:26 +08:00
Keyong Zhou
fe3b5988f2
[REFACTOR] Change package name to org.apache.celeborn (#710) 2022-10-02 18:10:29 +08:00
AngersZhuuuu
343caba83c
[ISSUE-656][FEATURE] Support get user quota from quota conf setting (#659)
[ISSUE-656][FEATURE] Support get user quota from rssConf setting
2022-09-29 12:55:01 +08:00
Ethan Feng
b4654d788c
[ISSUE-607]Add map ids info for each PartitionLocation to enable filtering for m… (#619) 2022-09-23 15:21:41 +08:00
Cheng Pan
4b42219595
Remove log4j1 (#501) 2022-09-05 19:30:15 +08:00
nafiy
6d308eb4f2
[ISSUE-465][Bug] Common module scalatest style unit test don't actually run (#472) 2022-08-28 18:52:39 +08:00
Ethan Feng
a4bab91453
[issue-332] support flush disk buffer to hdfs (#430) 2022-08-23 21:04:45 +08:00
Cheng Pan
f1f4b894af
Build: Enhance build system (#349) 2022-08-15 14:59:01 +08:00
Cheng Pan
d01ee81ee6
Bump Ratis 2.3.0 and related toolchains (#299) 2022-08-04 21:59:42 +08:00
AngersZhuuuu
fe17914942
Refactor pom import issue (#277) 2022-07-25 17:49:55 +08:00
mingji
d4d8eb3838 update pom version. 2022-06-24 14:28:42 +08:00
Ethan Feng
1113f437c6
[FEATURE] Remove dependency on spark-tags from common module (#126) (#128) 2022-05-31 15:24:08 +08:00
nafiy
491f89bbb5
[FEATURE]Add metrics source for JVM and CPU (#125)
* Add metrics source for JVM and CPU

* Fix scala style issue
2022-05-30 13:26:54 +08:00
Ethan Feng
86adc0d244
[Feature]Add metrics documentation and grafana dashboard. (#117) 2022-05-20 12:12:41 +08:00
Ethan Feng
baa2836216
Add metrics: (#85)
1.shuffle fetch send data time.
 2.open stream time.
 3.memory critical count.
2022-04-02 15:05:27 +08:00
Ethan Feng
9ad8254b0a
AQE support. (#67) 2022-04-01 20:19:01 +08:00
wangshengjie123
b2a6091b55
[Feature] Make log4j2 as optional in case to we can update log4j2.xml to change log level (#56) 2022-03-08 22:33:06 +08:00
Ethan Feng
356a1952e4
Multi Client Support (#47)
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2022-01-29 22:28:06 +08:00
zky.zhoukeyong
ba5920acde Initial Commit for RSS 2021-12-28 20:57:35 +08:00