Commit Graph

42 Commits

Author SHA1 Message Date
SteNicholas
4dfcd9b56b [CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake
### What changes were proposed in this pull request?

Introduce JVM monitoring in Celeborn Worker using JVMQuake to enable early detection of memory management issues and facilitate fast failure.

### Why are the changes needed?

When facing out-of-control memory management in Celeborn Worker we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered. Therefore JVMQuake is introduced to provide more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. Refers to the principle of [jvmquake](https://github.com/Netflix-Skunkworks/jvmquake) which is a JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`JVMQuakeSuite`

Closes #2061 from SteNicholas/CELEBORN-1092.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-28 20:45:08 +08:00
Fu Chen
aab073ab16
[CELEBORN-1125] Bump guava from 14.0.1 to 32.1.3-jre
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

- bump guava from 14.0.1 to 32.1.3-jre
- refer to https://github.com/apache/spark/pull/26911, remove usages of Guava that no longer work in Guava 27/32, and replace with workalikes. After this PR, Celeborn no longer relies on a specific version of Guava, and is compatible with Guava 14/27/32. we have the ability to specify Guava to 27 when running MapReduce integration tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2090 from cfmcgrady/guava-27.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-21 16:18:14 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00
sychen
6fa669748c [CELEBORN-999] MR deps check
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
./dev/dependencies.sh  --module mr --check
./dev/dependencies.sh  --module mr --check --sbt
```

Closes #1928 from cxzl25/CELEBORN-999.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-11 13:56:31 +08:00
sychen
beed2a85b0
[CELEBORN-977] Support RocksDB as recover DB backend
### What changes were proposed in this pull request?

### Why are the changes needed?

LevelDB does not support mac arm version.

```java
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8: dlopen(/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8, 0x0001): tried: '/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (fat file, but missing compatible architecture (have 'x86_64,i386', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (no such file), '/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (fat file, but missing compatible architecture (have 'x86_64,i386', need 'arm64'))]
  	at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
  	at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
  	at org.fusesource.leveldbjni.JniDBFactory.<clinit>(JniDBFactory.java:48)
  	at org.apache.celeborn.service.deploy.worker.shuffledb.LevelDBProvider.initLevelDB(LevelDBProvider.java:49)
  	at org.apache.celeborn.service.deploy.worker.shuffledb.DBProvider.initDB(DBProvider.java:30)
  	at org.apache.celeborn.service.deploy.worker.storage.StorageManager.<init>(StorageManager.scala:197)
  	at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:109)
  	at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:734)
  	at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
```

The released `leveldbjni-all` for `org.fusesource.leveldbjni` does not support AArch64 Linux, we need to use `org.openlabtesting.leveldbjni`.

See https://issues.apache.org/jira/browse/HADOOP-16614

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
local test

Closes #1913 from cxzl25/CELEBORN-977.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-19 09:20:33 +08:00
sychen
045c682c89 [CELEBORN-978] Improve dependency.sh replacement mode
### What changes were proposed in this pull request?

### Why are the changes needed?
When executing the update script locally, it may generate such a Log, which causes awk to exit with an error.
```
Downloading from nexus: httpxxxx
```

```bash
./dev/dependencies.sh --replace
```

```
awk: trying to access out of range field -1
 input record number 1, file
 source line number 2
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1914 from cxzl25/CELEBORN-978.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-16 09:35:13 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
Fu Chen
142d12caa5 [CELEBORN-929][INFRA] Add dependencies check CI
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1852 from cfmcgrady/audit-deps-ci.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-09-07 14:02:07 +08:00
Kent Yao
28449630f3 [CELEBORN-937][INFRA] Improve branch suggestion for backporting
### What changes were proposed in this pull request?

This PR automatically iterates to the next branch to be merged instead of using the latest all the time

### Why are the changes needed?

anti-misoperation

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manully

Closes #1870 from yaooqinn/CELEBORN-937.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-01 00:20:42 +08:00
Kent Yao
ba4f1bb2fe
[CELEBORN-931][INFRA] Fix merged pull requests resolution
### What changes were proposed in this pull request?

This PR fixes the resolution for merged pull requests. It appears that the user "asfgit" is no longer closing pull requests, but rather the committers are.

### Why are the changes needed?

Bugfix, make the merge script re-runnable again if you accidentally abort cherry-pick or change you mind later for backporting

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested locally

Closes #1862 from yaooqinn/CELEBORN-931.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-30 09:51:34 +08:00
Kent Yao
7e373feea7
[CELEBORN-930][INFRA][FOLLOWUP] Fix environment variable naming
### What changes were proposed in this pull request?

Replace JIRA_USERNAME and JIRA_PASSWORD with ASF_*

### Why are the changes needed?

hotfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manually

Closes #1861 from yaooqinn/CELEBORN-930_F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-29 23:33:04 +08:00
Kent Yao
df8b56a7c7 [CELEBORN-930][INFRA] Eagerly check if the token is valid to align with the behavior of username/password auth
### What changes were proposed in this pull request?

Previously, we allowed for token authentication when resolving Jira issues in pull request merging. However, the token auth is kinda lazy during the initial handshake, maintainers might get confused someday.

This pull request promptly calls the current_user() function to initiate authentication and provides clear instructions for token expiration.

see also 8523ee5d90

### Why are the changes needed?

make it easy for maintainers to update their expired Jira tokens.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a maintainer can test this with invalid tokens

Closes #1857 from yaooqinn/CELEBORN-930.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-29 21:33:11 +08:00
Kent Yao
2b657c5243 [CELEBORN-918][INFRA] Auto Assign First-time contributor with Contributors role
### What changes were proposed in this pull request?

As an incubating project, first-time contributors‘ welcome is routine. This PR adds automation for granting Contributors role to them to make them a assignable for issues

### Why are the changes needed?

GitHub - JIRA integration

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

tested at apache/spark project, and

```python
>>> asf_jira.project_roles("CELEBORN")
{'Developers': {'id': '10050', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10050'}, 'Contributors': {'id': '10010', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10010'}, 'PMC': {'id': '10011', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10011'}, 'Committers': {'id': '10001', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10001'}, 'Administrators': {'id': '10002', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10002'}, 'ASF Members': {'id': '10150', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10150'}, 'Users': {'id': '10040', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10040'}, 'Contributors 1': {'id': '10350', 'url': 'https://issues.apache.org/jira/rest/api/2/project/12324920/role/10350'}}

```

Closes #1839 from yaooqinn/CELEBORN-918.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-26 16:50:31 +08:00
Fu Chen
49b6b10d5e [CELEBORN-879] Add dev/dependencies.sh for audit dependencies
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1797 from cfmcgrady/audit-deps.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-26 15:59:20 +08:00
Kent Yao
77abb31a5b
[CELEBORN-910][INFRA] Support JIRA_ACCESS_TOKEN for merging script
### What changes were proposed in this pull request?

This PR supports JIRA_ACCESS_TOKEN for merge script to enable token auth

c36d54a569

### Why are the changes needed?

Tokens are more secure and easily revoked or expired.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Your Jira admins can create a token for verification.

Closes #1837 from yaooqinn/CELEBORN-910.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-24 20:02:44 +08:00
Kent Yao
1550f92086 [CELEBORN-907][INFRA] The Jira Python misses our assignee when it searches users again
…

### What changes were proposed in this pull request?

detailed desc can be found 8fb799d47b

### Why are the changes needed?

bypass upstream bug

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

I guess pan3793 has already hit the issue when resolving CELEBORN-903 at jira side

Closes #1832 from yaooqinn/CELEBORN-907.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-24 11:54:52 +08:00
Kent Yao
ad890e9381
[CELEBORN-903][INFRA] Fix list index out of range for JIRA resolution in merge_pr
### What changes were proposed in this pull request?

This PR fixes list index out-of-range error for the merge_pr script

The error occurs when the branch we merge into does not have a jira project version.

see also cb16591f9b

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

verification tbd by maintainer, you can checkout this PR and use this updated one the merge and test

Closes #1827 from yaooqinn/CELEBORN-903.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-23 18:49:55 +08:00
Cheng Pan
007b716b64
[CELEBORN-633][INFRA] Introduce PR merge script
### What changes were proposed in this pull request?

Introduce PR merge script `dev/merge_pr.py`, which is borrowed from Apache Spark

### Why are the changes needed?

This script simplifies the PR merge procedure

- auto backport to release branches
- auto close the JIRA ticket
- auto fill in the JIRA fixed version
- reserve the PR description in git log
- reserve the author and committer in git log

### Does this PR introduce _any_ user-facing change?

No, it's for committers.

### How was this patch tested?

a1de16a80f was merged by this tool

Closes #1539 from pan3793/CELEBORN-633.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-02 19:52:04 +08:00
Ethan Feng
114b1b4d62
[CELEBORN-548][FLINK] Support flink 1.17. (#1472) 2023-05-05 23:00:49 +08:00
Ethan Feng
93d2f106e0
[CELEBORN-548][FLINK] Support flink 1.15. (#1463) 2023-05-04 15:23:59 +08:00
Cheng Pan
a16ba0e807
[CELEBORN-180][BUILD] Script for creating binary release artifact (#1129) 2023-01-03 12:58:42 +08:00
Cheng Pan
7105f98829
[CELEBORN-160][BUILD] Spilt CI workflow (#1107) 2022-12-21 23:47:01 +08:00
Cheng Pan
dc66369973
[CELEBORN-150][BUILD] Reduce binary tarball size by sharing jars (#1095)
* [CELEBORN-150][BUILD] Reduce binary tarball size by sharing jars

* nit

* nit

* docker

* nit

* cp -R
2022-12-16 14:30:17 +08:00
Shuang
f3f104870c
[CELEBORN-75] Initialize flink plugin module (#1027) 2022-12-07 15:53:00 +08:00
Cheng Pan
df7cb8550b
[INFRA] Inroduce checkout_pr.sh shell script (#968) 2022-11-14 22:28:43 +08:00
Binjie Yang
f51fae6c75
[REFACTOR] Replace the missing Remote Shuffle Service (#885) 2022-10-28 17:37:59 +08:00
Cheng Pan
65614edfbb
[BUILD] Create shaded module for Spark client (#878) 2022-10-27 22:11:54 +08:00
Cheng Pan
873eeeb1ed
[BUILD] Add apache- prefix in release tarball name (#854) 2022-10-25 22:39:48 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00
Cheng Pan
29210fe9b7
[BUILD] Build in serial mode (#545) 2022-09-05 20:05:33 +08:00
Cheng Pan
82566148d8
Use different artifact name for shuffle manager 2/3 (#541) 2022-09-05 19:47:24 +08:00
Cheng Pan
c88ce306be
Use Spotless to auto check and reformat Java/Scala code (#497) 2022-09-01 21:19:56 +08:00
Cheng Pan
3dddb65f31
Enable Apache Rat and fix license header (#492) 2022-08-31 23:53:33 +08:00
Cheng Pan
fc96034742
[BUILD] Flatten Jars for Master and Worker (#469) 2022-08-26 21:33:38 +08:00
Ethan Feng
a4bab91453
[issue-332] support flush disk buffer to hdfs (#430) 2022-08-23 21:04:45 +08:00
Cheng Pan
9b6ec58e2a
Add profile for Spark 3.2/3.3 (#380) 2022-08-17 22:27:43 +08:00
Cheng Pan
bb0c9b21fc
[ISSUE-350] Rewrite RssShuffleManager using Java to pass compile on Spark 3.1+ (#370) 2022-08-17 15:59:50 +08:00
Cheng Pan
f1f4b894af
Build: Enhance build system (#349) 2022-08-15 14:59:01 +08:00
Ethan Feng
f3bcb7f6a8
[ISSUE-146]update slots distribution mechanism (#273) 2022-08-12 23:38:19 +08:00
Binjie Yang
8ececd60a6
fix (#314) 2022-08-10 16:37:43 +08:00
Ethan Feng
9ad8254b0a
AQE support. (#67) 2022-04-01 20:19:01 +08:00
zky.zhoukeyong
ba5920acde Initial Commit for RSS 2021-12-28 20:57:35 +08:00