Commit Graph

59 Commits

Author SHA1 Message Date
SteNicholas
fa25ba8e1c
[CELEBORN-1366] Bump guava from 32.1.3-jre to 33.1.0-jre
### What changes were proposed in this pull request?

Bump guava from 32.1.3-jre to 33.1.0-jre.

### Why are the changes needed?

Guava v33.1.0 has been released, which release note refers to [v33.1.0](https://github.com/google/guava/releases/tag/v33.1.0). v33.1.0 brings some bug fixes and optimizations as follows:

* cache: Fixed a bug that could cause https://github.com/google/guava/pull/6851#issuecomment-1931276822 for `CacheLoader`/`CacheBuilder`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2439 from SteNicholas/CELEBORN-1366.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-02 16:46:03 +08:00
Fei Wang
adbc77cd4f [CELEBORN-1317] Refine celeborn http server and support swagger ui
### What changes were proposed in this pull request?

Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.

In this PR, we refine the code for http service and provide swagger ui.

Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.

TODO:
1. define DTO
2. http request authentication

<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">

<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">

### Why are the changes needed?

To close CELEBORN-1317

### Does this PR introduce _any_ user-facing change?

The api is align with before.

### How was this patch tested?
UT.

Closes #2371 from turboFei/jetty.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-27 23:18:18 +08:00
zky.zhoukeyong
7af3126c7e Support Spark3.5 with JDK21
### What changes were proposed in this pull request?
Compile Spark-3.5 with
`./build/make-distribution.sh -Pspark-3.5 -Pjdk-21`
or
`./build/make-distribution.sh --sbt-enabled -Pspark-3.5 -Pjdk-21`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual tests

Closes #2385 from waitinfuture/1327.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-27 18:42:16 +08:00
SteNicholas
c9b878a2f5
[INFRA] Remove incubator/incubating for graduation
### What changes were proposed in this pull request?

Remove incubator/incubating for graduation including:

- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.

Fix #2415.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2421 from SteNicholas/infra-graduation.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-27 13:54:47 +08:00
SteNicholas
73cf1562f7 [CELEBORN-1299] Introduce JVM profiling in Celeborn Worker using async-profiler
### What changes were proposed in this pull request?

Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles.

### Why are the changes needed?

[async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.

Backport [[SPARK-46094] Support Executor JVM Profiling](https://github.com/apache/spark/pull/44021).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Worker cluster test.

Closes #2409 from SteNicholas/CELEBORN-1299.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 14:05:50 +08:00
SteNicholas
adaa96fc60 [CELEBORN-1310][FLINK] Support Flink 1.19
### What changes were proposed in this pull request?

Support Flink 1.19.

### Why are the changes needed?

Flink 1.19.0 is announced to release: [Announcing the Release of Apache Flink 1.19] (https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19).

The main changes includes:

- `org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel` constructor change parameters:
   - `consumedSubpartitionIndex` changes to `consumedSubpartitionIndexSet`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
   - adds `partitionRequestListenerTimeout`: [[FLINK-25055][network] Support listen and notify mechanism for partition request](https://github.com/apache/flink/pull/23565).
- `org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor removes parameters `subpartitionIndexRange`, `tieredStorageConsumerClient`, `nettyService` and `tieredStorageConsumerSpecs`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- Change the default config file to `config.yaml` in `flink-dist`: [[FLINK-33577][dist] Change the default config file to config.yaml in flink-dist](https://github.com/apache/flink/pull/24177).
- `org.apache.flink.configuration.RestartStrategyOptions` uses `org.apache.commons.compress.utils.Sets` of `commons-compress` dependency: [[FLINK-33865][runtime] Adding an ITCase to ensure exponential-delay.attempts-before-reset-backoff works well](https://github.com/apache/flink/pull/23942).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test:

- Flink batch job submission

```
$ ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 2e9fb659991a9c29d376151783bdf6de
Program execution finished
Job with JobID 2e9fb659991a9c29d376151783bdf6de has finished.
Job Runtime: 1912 ms
```

- Flink batch job execution

![image](https://github.com/apache/incubator-celeborn/assets/10048174/18b60861-cafc-4df3-b94d-93307e728be2)

- Celeborn master log
```

24/03/18 20:52:47,513 INFO [celeborn-dispatcher-42] Master: Offer slots successfully for 1 reducers of 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 on 1 workers.
```

- Celeborn worker log
```
24/03/18 20:52:47,704 INFO [celeborn-dispatcher-1] StorageManager: created file at /Users/nicholas/Software/Celeborn/apache-celeborn-0.5.0-SNAPSHOT/shuffle/celeborn-worker/shuffle_data/1710766312631-2e9fb659991a9c29d376151783bdf6de/0/0-0-0
24/03/18 20:52:47,707 INFO [celeborn-dispatcher-1] Controller: Reserved 1 primary location and 0 replica location for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,874 INFO [celeborn-dispatcher-2] Controller: Start commitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,890 INFO [worker-rpc-async-replier] Controller: CommitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
```

Closes #2399 from SteNicholas/CELEBORN-1310.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-20 11:51:23 +08:00
SteNicholas
12c3779805 [CELEBORN-1330] Bump rocksdbjni version from 8.5.3 to 8.11.3
### What changes were proposed in this pull request?

Bump `rocksdbjni` version from 8.5.3 to 8.11.3.

### Why are the changes needed?

The new version bring some bug fixes:

- Fix a corner case with auto_readahead_size where Prev Operation returns NOT SUPPORTED error when scans direction is changed from forward to backward.
- Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
- Fix double counting of BYTES_WRITTEN ticker when doing writes with transactions.
- Fix a WRITE_STALL counter that was reporting wrong value in few cases.
- A lookup by MultiGet in a TieredCache that goes to the local flash cache and finishes with very low latency, i.e before the subsequent call to WaitAll, is ignored, resulting in a false negative and a memory leak.
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
- Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
- Fix bugs where rocksdb.blobdb.blob.file.synced includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written includes blob bytes failed to get written.
- Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
- Fixed some cases of in-memory data corruption using mmap reads with BackupEngine, sst_dump, or ldb.
- Fixed issues with experimental preclude_last_level_data_seconds option that could interfere with expected data tiering.
- Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.

The full release notes as follows: [rocksdbjni releases](https://github.com/facebook/rocksdb/releases).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2389 from SteNicholas/CELEBORN-1330.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-14 18:01:03 +08:00
Fei Wang
1200e97b6c [BUILD] Bump netty version to latest 4.1.107.Final
### What changes were proposed in this pull request?
Update netty to latest version.

### Why are the changes needed?
[Netty 4.1.107.Final](https://netty.io/news/2024/02/13/4-1-107-Final.html) has been released two weeks ago, seems many useful changes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #2328 from turboFei/netty_bump.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-25 21:55:13 +08:00
Shuang
d89dcf0e06 [CELEBORN-1054] Support db based dynamic config service
### What changes were proposed in this pull request?

Support database based store backend implementation for dynamic configuration management

### Why are the changes needed?

Currently celeborn provides `FsConfigServiceImpl` implementation for dynamic config service which is based on file system, We cloud Support database based store backend implementation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- `ConfigServiceSuiteJ#testDbConfig`

Closes #2273 from RexXiong/CELEBORN-1054.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-05 13:23:25 +08:00
tiny-dust
d315ff5055 [CELEBORN-1240] Introduce Husky Configuration to Celeborn Web
![image](https://github.com/apache/incubator-celeborn/assets/49502875/4404770c-c46e-470b-8f5e-c244c6656339)

### What changes were proposed in this pull request?

- Added Husky to enforce code quality with automated tasks during Git events.
- Added lint-staged for optimized linting on staged files before each commit.

### Why are the changes needed?

Enhances code quality.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2250 from tiny-dust/CELEBORN-1240.

Lead-authored-by: tiny-dust <idioticzhou@foxmail.com>
Co-authored-by: 周顺顺 <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-01-26 16:23:42 +08:00
pengqli
a808c252ba
[CELEBORN-1184] Update the snakeyaml version from 1.33 to 2.2
### What changes were proposed in this pull request?
Update the snakeyaml version from 1.33 to 2.2 reducing direct CVE vulnerabilities.

### Why are the changes needed?
The snakeyaml version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-1471

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.

Closes #2170 from dev-lpq/snakeyaml_version.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-20 21:23:22 +08:00
pengqli
1037fbf921 [CELEBORN-1173] Upgrade netty version from 4.1.93.Final to 4.1.101.Final
### What changes were proposed in this pull request?
upgrade netty all version from 4.1.93.Final to 4.1.101.Final reducing direct CVE vulnerabilities

### Why are the changes needed?
The netty version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-4586
https://scout.docker.com/vulnerabilities/id/CVE-2023-44487
https://scout.docker.com/vulnerabilities/id/GHSA-xpw8-rcwv-8f8p

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.

Closes #2150 from dev-lpq/update_netty_all_version.

Lead-authored-by: pengqli <pengqli@cisco.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 14:03:37 +08:00
pengqli
0860553e18 [CELEBORN-1163] Upgrade protobuf from 3.19.2 to 3.21.7
### What changes were proposed in this pull request?
upgrade protobuf from 3.19.2 to 3.21.7 reducing direct CVE vulnerabilities

### Why are the changes needed?

The protobuf version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-3510
https://scout.docker.com/vulnerabilities/id/CVE-2022-3509
https://scout.docker.com/vulnerabilities/id/CVE-2021-22570
https://scout.docker.com/vulnerabilities/id/CVE-2021-22569

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
`./build/make-distribution.sh` to package and run test on the local.

Closes #2142 from dev-lpq/upgrade_protobuf-java_version.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 13:58:36 +08:00
sychen
2504b50dd2 [CELEBORN-1170] Upgrade snappy-java from 1.1.8.2 to 1.1.10.5
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/2143

The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-43642
https://scout.docker.com/vulnerabilities/id/CVE-2023-34455

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2158 from cxzl25/CELEBORN-1170.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 22:28:32 +08:00
pengqli
80458d18fa upgrade snappy-java from 1.1.8.2 to 1.1.10.5
### What changes were proposed in this pull request?
upgrade snappy-java from 1.1.8.2 to 1.1.10.5 reducing direct CVE vulnerabilities

### Why are the changes needed?
The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-43642
https://scout.docker.com/vulnerabilities/id/CVE-2023-34455

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
`./build/make-distribution.sh` to package and run test on the local

Closes #2143 from dev-lpq/update_snappy_java.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-11 18:38:06 +08:00
qinrui
04a1e90207 [CELEBORN-1122] Metrics supports json format
### What changes were proposed in this pull request?
If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Metrics supports JSON format

### How was this patch tested?
Cluster test.

Closes #2089 from suizhe007/CELEBORN-1122.

Authored-by: qinrui <qr7972@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-06 09:24:28 +08:00
sychen
89b6cac5ab
[CELEBORN-1113] Bump Hadoop client version from 3.2.4 to 3.3.6
### What changes were proposed in this pull request?

### Why are the changes needed?

[[HADOOP-17098](https://issues.apache.org/jira/browse/HADOOP-17098)] Reduce Guava dependency in Hadoop source code

The higher version of hadoop client removes many guava-related methods, which avoids some conflicts on guava.

`hadoop-client-api` 3.3.6
`hadoop-client-runtime` 3.3.6

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2077 from cxzl25/CELEBORN-1113.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-01 15:41:04 +08:00
SteNicholas
4dfcd9b56b [CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake
### What changes were proposed in this pull request?

Introduce JVM monitoring in Celeborn Worker using JVMQuake to enable early detection of memory management issues and facilitate fast failure.

### Why are the changes needed?

When facing out-of-control memory management in Celeborn Worker we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered. Therefore JVMQuake is introduced to provide more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. Refers to the principle of [jvmquake](https://github.com/Netflix-Skunkworks/jvmquake) which is a JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`JVMQuakeSuite`

Closes #2061 from SteNicholas/CELEBORN-1092.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-28 20:45:08 +08:00
Fu Chen
aab073ab16
[CELEBORN-1125] Bump guava from 14.0.1 to 32.1.3-jre
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

- bump guava from 14.0.1 to 32.1.3-jre
- refer to https://github.com/apache/spark/pull/26911, remove usages of Guava that no longer work in Guava 27/32, and replace with workalikes. After this PR, Celeborn no longer relies on a specific version of Guava, and is compatible with Guava 14/27/32. we have the ability to specify Guava to 27 when running MapReduce integration tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2090 from cfmcgrady/guava-27.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-21 16:18:14 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00
sychen
6fa669748c [CELEBORN-999] MR deps check
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
./dev/dependencies.sh  --module mr --check
./dev/dependencies.sh  --module mr --check --sbt
```

Closes #1928 from cxzl25/CELEBORN-999.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-11 13:56:31 +08:00
sychen
beed2a85b0
[CELEBORN-977] Support RocksDB as recover DB backend
### What changes were proposed in this pull request?

### Why are the changes needed?

LevelDB does not support mac arm version.

```java
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8: dlopen(/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8, 0x0001): tried: '/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (fat file, but missing compatible architecture (have 'x86_64,i386', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (no such file), '/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (fat file, but missing compatible architecture (have 'x86_64,i386', need 'arm64'))]
  	at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
  	at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
  	at org.fusesource.leveldbjni.JniDBFactory.<clinit>(JniDBFactory.java:48)
  	at org.apache.celeborn.service.deploy.worker.shuffledb.LevelDBProvider.initLevelDB(LevelDBProvider.java:49)
  	at org.apache.celeborn.service.deploy.worker.shuffledb.DBProvider.initDB(DBProvider.java:30)
  	at org.apache.celeborn.service.deploy.worker.storage.StorageManager.<init>(StorageManager.scala:197)
  	at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:109)
  	at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:734)
  	at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
```

The released `leveldbjni-all` for `org.fusesource.leveldbjni` does not support AArch64 Linux, we need to use `org.openlabtesting.leveldbjni`.

See https://issues.apache.org/jira/browse/HADOOP-16614

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
local test

Closes #1913 from cxzl25/CELEBORN-977.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-19 09:20:33 +08:00
sychen
045c682c89 [CELEBORN-978] Improve dependency.sh replacement mode
### What changes were proposed in this pull request?

### Why are the changes needed?
When executing the update script locally, it may generate such a Log, which causes awk to exit with an error.
```
Downloading from nexus: httpxxxx
```

```bash
./dev/dependencies.sh --replace
```

```
awk: trying to access out of range field -1
 input record number 1, file
 source line number 2
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1914 from cxzl25/CELEBORN-978.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-16 09:35:13 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
Fu Chen
142d12caa5 [CELEBORN-929][INFRA] Add dependencies check CI
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1852 from cfmcgrady/audit-deps-ci.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-09-07 14:02:07 +08:00
Kent Yao
28449630f3 [CELEBORN-937][INFRA] Improve branch suggestion for backporting
### What changes were proposed in this pull request?

This PR automatically iterates to the next branch to be merged instead of using the latest all the time

### Why are the changes needed?

anti-misoperation

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manully

Closes #1870 from yaooqinn/CELEBORN-937.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-01 00:20:42 +08:00
Kent Yao
ba4f1bb2fe
[CELEBORN-931][INFRA] Fix merged pull requests resolution
### What changes were proposed in this pull request?

This PR fixes the resolution for merged pull requests. It appears that the user "asfgit" is no longer closing pull requests, but rather the committers are.

### Why are the changes needed?

Bugfix, make the merge script re-runnable again if you accidentally abort cherry-pick or change you mind later for backporting

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested locally

Closes #1862 from yaooqinn/CELEBORN-931.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-30 09:51:34 +08:00
Kent Yao
7e373feea7
[CELEBORN-930][INFRA][FOLLOWUP] Fix environment variable naming
### What changes were proposed in this pull request?

Replace JIRA_USERNAME and JIRA_PASSWORD with ASF_*

### Why are the changes needed?

hotfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manually

Closes #1861 from yaooqinn/CELEBORN-930_F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-29 23:33:04 +08:00
Kent Yao
df8b56a7c7 [CELEBORN-930][INFRA] Eagerly check if the token is valid to align with the behavior of username/password auth
### What changes were proposed in this pull request?

Previously, we allowed for token authentication when resolving Jira issues in pull request merging. However, the token auth is kinda lazy during the initial handshake, maintainers might get confused someday.

This pull request promptly calls the current_user() function to initiate authentication and provides clear instructions for token expiration.

see also 8523ee5d90

### Why are the changes needed?

make it easy for maintainers to update their expired Jira tokens.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a maintainer can test this with invalid tokens

Closes #1857 from yaooqinn/CELEBORN-930.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-29 21:33:11 +08:00
Kent Yao
2b657c5243 [CELEBORN-918][INFRA] Auto Assign First-time contributor with Contributors role
### What changes were proposed in this pull request?

As an incubating project, first-time contributors‘ welcome is routine. This PR adds automation for granting Contributors role to them to make them a assignable for issues

### Why are the changes needed?

GitHub - JIRA integration

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested at apache/spark project, and

```python
>>> asf_jira.project_roles("CELEBORN")
{'Developers': {'id': '10050', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10050'}, 'Contributors': {'id': '10010', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10010'}, 'PMC': {'id': '10011', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10011'}, 'Committers': {'id': '10001', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10001'}, 'Administrators': {'id': '10002', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10002'}, 'ASF Members': {'id': '10150', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10150'}, 'Users': {'id': '10040', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10040'}, 'Contributors 1': {'id': '10350', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10350'}}

```

Closes #1839 from yaooqinn/CELEBORN-918.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-26 16:50:31 +08:00
Fu Chen
49b6b10d5e [CELEBORN-879] Add dev/dependencies.sh for audit dependencies
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1797 from cfmcgrady/audit-deps.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-26 15:59:20 +08:00
Kent Yao
77abb31a5b
[CELEBORN-910][INFRA] Support JIRA_ACCESS_TOKEN for merging script
### What changes were proposed in this pull request?

This PR supports JIRA_ACCESS_TOKEN for merge script to enable token auth

c36d54a569

### Why are the changes needed?

Tokens are more secure and easily revoked or expired.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Your Jira admins can create a token for verification.

Closes #1837 from yaooqinn/CELEBORN-910.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-24 20:02:44 +08:00
Kent Yao
1550f92086 [CELEBORN-907][INFRA] The Jira Python misses our assignee when it searches users again
…

### What changes were proposed in this pull request?

detailed desc can be found 8fb799d47b

### Why are the changes needed?

bypass upstream bug

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

I guess pan3793 has already hit the issue when resolving CELEBORN-903 at jira side

Closes #1832 from yaooqinn/CELEBORN-907.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-24 11:54:52 +08:00
Kent Yao
ad890e9381
[CELEBORN-903][INFRA] Fix list index out of range for JIRA resolution in merge_pr
### What changes were proposed in this pull request?

This PR fixes list index out-of-range error for the merge_pr script

The error occurs when the branch we merge into does not have a jira project version.

see also cb16591f9b

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

verification tbd by maintainer, you can checkout this PR and use this updated one the merge and test

Closes #1827 from yaooqinn/CELEBORN-903.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-23 18:49:55 +08:00
Cheng Pan
007b716b64
[CELEBORN-633][INFRA] Introduce PR merge script
### What changes were proposed in this pull request?

Introduce PR merge script `dev/merge_pr.py`, which is borrowed from Apache Spark

### Why are the changes needed?

This script simplifies the PR merge procedure

- auto backport to release branches
- auto close the JIRA ticket
- auto fill in the JIRA fixed version
- reserve the PR description in git log
- reserve the author and committer in git log

### Does this PR introduce _any_ user-facing change?

No, it's for committers.

### How was this patch tested?

a1de16a80f was merged by this tool

Closes #1539 from pan3793/CELEBORN-633.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-02 19:52:04 +08:00
Ethan Feng
114b1b4d62
[CELEBORN-548][FLINK] Support flink 1.17. (#1472) 2023-05-05 23:00:49 +08:00
Ethan Feng
93d2f106e0
[CELEBORN-548][FLINK] Support flink 1.15. (#1463) 2023-05-04 15:23:59 +08:00
Cheng Pan
a16ba0e807
[CELEBORN-180][BUILD] Script for creating binary release artifact (#1129) 2023-01-03 12:58:42 +08:00
Cheng Pan
7105f98829
[CELEBORN-160][BUILD] Spilt CI workflow (#1107) 2022-12-21 23:47:01 +08:00
Cheng Pan
dc66369973
[CELEBORN-150][BUILD] Reduce binary tarball size by sharing jars (#1095)
* [CELEBORN-150][BUILD] Reduce binary tarball size by sharing jars

* nit

* nit

* docker

* nit

* cp -R
2022-12-16 14:30:17 +08:00
Shuang
f3f104870c
[CELEBORN-75] Initialize flink plugin module (#1027) 2022-12-07 15:53:00 +08:00
Cheng Pan
df7cb8550b
[INFRA] Inroduce checkout_pr.sh shell script (#968) 2022-11-14 22:28:43 +08:00
Binjie Yang
f51fae6c75
[REFACTOR] Replace the missing Remote Shuffle Service (#885) 2022-10-28 17:37:59 +08:00
Cheng Pan
65614edfbb
[BUILD] Create shaded module for Spark client (#878) 2022-10-27 22:11:54 +08:00
Cheng Pan
873eeeb1ed
[BUILD] Add apache- prefix in release tarball name (#854) 2022-10-25 22:39:48 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00
Cheng Pan
29210fe9b7
[BUILD] Build in serial mode (#545) 2022-09-05 20:05:33 +08:00
Cheng Pan
82566148d8
Use different artifact name for shuffle manager 2/3 (#541) 2022-09-05 19:47:24 +08:00
Cheng Pan
c88ce306be
Use Spotless to auto check and reformat Java/Scala code (#497) 2022-09-01 21:19:56 +08:00
Cheng Pan
3dddb65f31
Enable Apache Rat and fix license header (#492) 2022-08-31 23:53:33 +08:00