Commit Graph

361 Commits

Author SHA1 Message Date
SteNicholas
450dac8245
[CELEBORN-1447] Support configuring thread number of worker to wait for commit shuffle data files to finish
### What changes were proposed in this pull request?

Introduce `celeborn.worker.commitFiles.wait.threads` to support configuring thread number of worker to wait for commit shuffle data files to finish.

### Why are the changes needed?

`celeborn.worker.commitFiles.threads` supports the configuration that is the thread number of worker to commit shuffle data files asynchronously including waiting for commit files to finish at present. It should support to configure thread number of waiting for commit shuffle data files to finish which avoids the situation where the commit thread pool is waiting for commit files and no thread could commit shuffle data files.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2539 from SteNicholas/CELEBORN-1447.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-03 17:47:01 +08:00
SteNicholas
2a57fab869 [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1
### What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

- RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(https://github.com/apache/ratis/pull/964)
- RATIS-1677. Do not auto format RaftStorage in RECOVER.(https://github.com/apache/ratis/pull/718)
- RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)

### Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

- 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:
   - Incompatible Changes
      - RaftStorage Auto-Format
      - RATIS-1677. Do not auto format RaftStorage in RECOVER. (https://github.com/apache/ratis/pull/718)
      - RATIS-1694. Fix the compatibility issue of RATIS-1677. (https://github.com/apache/ratis/pull/731)
      - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (https://github.com/apache/ratis/pull/903)
      - Pluggable Ratis-Metrics (RATIS-1688)
      - RATIS-1689. Remove the use of the thirdparty Gauge. (https://github.com/apache/ratis/pull/728)
      - RATIS-1692. Remove the use of the thirdparty Counter. (https://github.com/apache/ratis/pull/732)
      - RATIS-1693. Remove the use of the thirdparty Timer. (https://github.com/apache/ratis/pull/734)
      - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (https://github.com/apache/ratis/pull/741)
      - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (https://github.com/apache/ratis/pull/742)
      - RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)
      - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (https://github.com/apache/ratis/pull/751)
      - RATIS-1391. Update library dropwizard.metrics version to 4.x (https://github.com/apache/ratis/pull/632)
      - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (https://github.com/apache/ratis/pull/671)
      - Streaming Protocol Change
      - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (https://github.com/apache/ratis/pull/635)
   - New Features
      - Leader Lease (RATIS-1864)
      - RATIS-1865. Add leader lease bound ratio configuration (https://github.com/apache/ratis/pull/897)
      - RATIS-1866. Maintain leader lease after AppendEntries (https://github.com/apache/ratis/pull/898)
      - RATIS-1894. Implement ReadOnly based on leader lease (https://github.com/apache/ratis/pull/925)
      - RATIS-1882. Support read-after-write consistency (https://github.com/apache/ratis/pull/913)
      - StateMachine API
      - RATIS-1874. Add notifyLeaderReady function in IStateMachine (https://github.com/apache/ratis/pull/906)
      - RATIS-1897. Make TransactionContext available in DataApi.write(..). (https://github.com/apache/ratis/pull/930)
      - New Configuration Properties
      - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (https://github.com/apache/ratis/pull/896)
      - RATIS-1930. Add a conf for enable/disable majority-add. (https://github.com/apache/ratis/pull/961)
      - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (https://github.com/apache/ratis/pull/950)
      - RATIS-1636. Support re-config ratis properties (https://github.com/apache/ratis/pull/800)
      - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (https://github.com/apache/ratis/pull/901)
   - Improvements & Bug Fixes
      - Netty
         - RATIS-1898. Netty should use EpollEventLoopGroup by default (https://github.com/apache/ratis/pull/931)
         - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (https://github.com/apache/ratis/pull/932)
         - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (https://github.com/apache/ratis/pull/955)
         - RATIS-1923. Netty: atomic operations require side-effect-free functions. (https://github.com/apache/ratis/pull/956)
      - RaftServer
         - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (https://github.com/apache/ratis/pull/957)
         - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (https://github.com/apache/ratis/pull/923)
         - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge https://github.com/apache/ratis/pull/922 (https://github.com/apache/ratis/pull/922)
         - RATIS-761. Handle writeStateMachineData failure in leader. (https://github.com/apache/ratis/pull/927)
         - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (https://github.com/apache/ratis/pull/933)
         - RATIS-1912. Fix infinity election when perform membership change. (https://github.com/apache/ratis/pull/954)
         - RATIS-1858. Follower keeps logging first election timeout. (https://github.com/apache/ratis/pull/894)

- 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](https://github.com/apache/ratis/compare/ratis-3.0.0...ratis-3.0.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster manual test.

Closes #2480 from SteNicholas/CELEBORN-1400.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-30 17:22:22 +08:00
mingji
fd490013ae
Revert "[CELEBORN-1388] Use finer grained locks in changePartitionManager"
This reverts commit 9f304798cb.
2024-05-30 11:18:58 +08:00
SteNicholas
dd87419044
[CELEBORN-1380][FOLLOWUP] leveldbjni uses org.openlabtesting.leveldbjni to support linux aarch64 platform for leveldb via aarch64 profile
### What changes were proposed in this pull request?

Dependency leveldbjni uses `org.openlabtesting.leveldbjni` to support linux aarch64 platform for leveldb via `aarch64` profile.

Follow up #2476.

### Why are the changes needed?

Celeborn worker could not start on arm arch devices if db backend is `LevelDB`, which should support leveldbjni on the aarch64 platform.

aarch64 uses `org.openlabtesting.leveldbjni:leveldbjni-all.1.8`, and other platforms use `org.fusesource.leveldbjni:leveldbjni-all.1.8`. Meanwhile, because some hadoop dependencies packages are also depend on `org.fusesource.leveldbjni:leveldbjni-all`, but hadoop merge the similar change on trunk, details see
[HADOOP-16614](https://issues.apache.org/jira/browse/HADOOP-16614), therefore it should exclude the dependency of `org.fusesource.leveldbjni` for these hadoop packages related.

In addtion, `org.openlabtesting.leveldbjni` requires glibc version 3.4.21. Otherwise, there will be the following potential runtime risks:

```
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fad3630b12a, pid=62, tid=0x00007f93394ef700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x8412a]
#
# Core dump written. Default location: /data/service/celeborn/core or core.62
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f9308001000):  JavaThread "leveldb" [_thread_in_native, id=878, stack(0x00007f9338cf0000,0x00007f93394f0000)]

siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007f97380d2220
```

Backport:

- https://github.com/apache/spark/pull/26636
- https://github.com/apache/spark/pull/31036

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2530 from SteNicholas/CELEBORN-1380.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-27 14:07:02 +08:00
Sanskar Modi
f527b22b4d [CELEBORN-1410] Combine multiple ShuffleBlockInfo into a single ShuffleBlockInfo
### What changes were proposed in this pull request?

Merging smaller `ShuffleBlockInfo` corresponding into same mapID, such that size of each block does not exceeds `celeborn.shuffle.chunk.size`

### Why are the changes needed?
As sorted ShuffleBlocks are contiguous, we can compact multiple `ShuffleBlockInfo` into one as long as the size of compacted one does not exceeds half of `celeborn.shuffle.chunk.size`. This way we can decrease the number of ShuffleBlock objects.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs

Closes #2524 from s0nskar/CELEBORN-1410.

Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-05-23 21:22:45 +08:00
mingji
89d56c9bbc
[CELEBORN-914] Support memory file storage
### What changes were proposed in this pull request?
To support memory file storage.

### Why are the changes needed?
To improve shuffle performance for small shuffle files.

Design doc: https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA and manually test on a cluster.

Closes #2300 from FMX/B914.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-23 21:05:52 +08:00
Shuang
308eed28c9 [CELEBORN-1427] Add Capacity metrics for Celeborn
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

Closes #2521 from RexXiong/CELEBORN-1427.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-23 16:06:11 +08:00
SteNicholas
cd5609971f
[CELEBORN-1434] Support MRAppMasterWithCeleborn to disable job recovery and job reduce slow start by default
### What changes were proposed in this pull request?

`MRAppMasterWithCeleborn` disables `yarn.app.mapreduce.am.job.recovery.enable` and sets `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.

### Why are the changes needed?

MapReduce does not set the flag which indicates whether to keep containers across application attempts in ApplicationSubmissionContext. Meanwhile, make sure reduces are scheduled only after all map are completed. Therefore, `MRAppMasterWithCeleborn` could disable `yarn.app.mapreduce.am.job.recovery.enable` and set `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTest`

Closes #2525 from SteNicholas/CELEBORN-1434.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-22 15:32:41 +08:00
Mridul Muralidharan
a13d167617 [CELEBORN-1401] Add SSL support for ratis communication
### What changes were proposed in this pull request?

When SSL is enabled for master, secure the Ratis communication as well with TLS

### Why are the changes needed?

Currently, when TLS is enabled for RPC, Ratis comms still goes in the clear - add support for TLS.
Note that currently this only supports GRPC, and not netty.

### Does this PR introduce _any_ user-facing change?
Secures ratis communication when TLS is enabled at master for rpc.

### How was this patch tested?
Local tests and additional unit tests added

Closes #2515 from mridulm/CELEBORN-1401-add-ratis-ssl-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-17 17:08:11 +08:00
SteNicholas
9908035ba8 [CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar
### What changes were proposed in this pull request?

`SparkShuffleManager` print warning log for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar via `--jar` or `spark.jars`.

### Why are the changes needed?

When `spark.executor.userClassPathFirst` is enabled with ShuffleManager defined in user jar, the `ClassLoader` of `handle` is `ChildFirstURLClassLoader`, which is different from `CelebornShuffleHandle` of which the `ClassLoader` is `AppClassLoader` in `SparkShuffleManager#getWriter/getReader`. The local test log is as follows:

```
./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
```
```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader
```

It causes that `SparkShuffleManager` fallback to vanilla Spark `SortShuffleManager` for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar before https://github.com/apache/spark/pull/43627. After [SPARK-45762](https://issues.apache.org/jira/browse/SPARK-45762), the `ClassLoader` of `handle` and `CelebornShuffleHandle` are both `ChildFirstURLClassLoader`.

```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader
```

Therefore, `SparkShuffleManager` should print warning log to remind for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2482 from SteNicholas/CELEBORN-1402.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-17 11:03:15 +08:00
Cheng Pan
e66d509a95
[CELEBORN-1369][FOLLOWUP] Improve docs for shuffle fallback policy
### What changes were proposed in this pull request?

Improve docs for shuffle fallback policy

Rename a configuration

```patch
- celeborn.client.spark.shuffle.forceFallback.numPartitionsThreshold
+ celeborn.client.spark.shuffle.fallback.numPartitionsThreshold
````

### Why are the changes needed?

Canonicalize the words to "spark built-in shuffle implementation" everywhere.

And `...forceFallback...` is confusing, use `...fallback...` with explicit docs instead.

### Does this PR introduce _any_ user-facing change?

Deprecate a configuration but still effective.

### How was this patch tested?

Pass CI.

Closes #2494 from pan3793/CELEBORN-1369-followup.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-05-15 19:18:39 +08:00
Yi Chen
c20536e5c5
[CELEBORN-1425][HELM] Add helm chart unit tests to ensure manifests are rendered as expected
### What changes were proposed in this pull request?

Add helm chart unit tests.

### Why are the changes needed?

Unit tests can make resource manifests are rendered as expected with various configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Detailed information about how to run helm chart unit tests can be found here [helm-unittest/helm-unittest](https://github.com/helm-unittest/helm-unittest). First, you need to install helm unit test plugin:

```shell
helm plugin install https://github.com/helm-unittest/helm-unittest.git
```

Then, run helm chart unitt tests as follows:

```shell
$ helm unittest charts/celeborn  --file "tests/**/*_test.yaml" --strict --debug
load_plugins.go:110: [info] file (/Users/chenyi/Library/helm/plugins/helm-acr/completion.yaml) not provided by plugin. No plugin auto-completion possible

### Chart [ celeborn ] charts/celeborn

 PASS  Test Celeborn configmap  charts/celeborn/tests/configmap_test.yaml
 PASS  Test Celeborn master pod monitor charts/celeborn/tests/master/podmonitor_test.yaml
 PASS  Test Celeborn master priority class      charts/celeborn/tests/master/priorityclass_test.yaml
 PASS  Test Celeborn master service     charts/celeborn/tests/master/service_test.yaml
 PASS  Test Celeborn master statefulset charts/celeborn/tests/master/statefulset_test.yaml
 PASS  Test Celeborn worker pod monitor charts/celeborn/tests/worker/podmonitor_test.yaml
 PASS  Test Celeborn worker priority class      charts/celeborn/tests/worker/priorityclass_test.yaml
 PASS  Test Celeborn worker service     charts/celeborn/tests/worker/service_test.yaml
 PASS  Test Celeborn worker statefulset charts/celeborn/tests/worker/statefulset_test.yaml

Charts:      1 passed, 1 total
Test Suites: 9 passed, 9 total
Tests:       48 passed, 48 total
Snapshot:    0 passed, 0 total
Time:        183.011375ms

```

Closes #2511 from ChenYi015/helm-unittest.

Authored-by: Yi Chen <github@chenyicn.net>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-05-15 19:17:30 +08:00
SteNicholas
db163bd793 [CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API
### What changes were proposed in this pull request?

Improve parameters, description and document of Celeborn REST API, including:

1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.

### Why are the changes needed?

The parameters, description and document of REST API need to improve after http server refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2495 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-09 17:41:13 +08:00
Shuang
9a9abfe3bc [CELEBORN-1245][FOLLOWUP] Fix SendWorkerEvent in HA mode
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Handle worker event use wrong request.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`RatisMasterStatusSystemSuiteJ#testHandleWorkerEvent`

Closes #2493 from RexXiong/CELEBORN-1245-FOLLOW-UP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-07 15:16:47 +08:00
SteNicholas
1cd231f5e0
[CELEBORN-1412] celeborn.client.rpc.*.askTimeout should fallback to celeborn.rpc.askTimeout
### What changes were proposed in this pull request?

`celeborn.client.rpc.*.askTimeout` should fallback to `celeborn.rpc.askTimeout`.

### Why are the changes needed?

The config option series `celeborn.client.rpc.*.askTimeout` should fallback to `celeborn.rpc.askTimeout` instead of `celeborn.<module>.io.connectionTimeout`, which including `celeborn.client.rpc.getReducerFileGroup.askTimeout`, `celeborn.client.rpc.registerShuffle.askTimeout` and `celeborn.client.rpc.requestPartition.askTimeout`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2492 from SteNicholas/CELEBORN-1412.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-07 13:47:22 +08:00
sychen
dc52192163 [CELEBORN-1409] CommitHandler commitFiles RPC supports separate timeout configuration
### What changes were proposed in this pull request?
This PR aims to supports separate timeout configuration at CommitHandler commitFiles RPC.

### Why are the changes needed?
The default value of `celeborn.worker.commitFiles.timeout` is 120s, and the default value of Client's RPC is 60s.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2488 from cxzl25/CELEBORN-1409.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-06 17:42:52 +08:00
xinyuwang1
7b1645ff6a [CELEBORN-1369] Support for disable fallback to Spark's default shuffle
### What changes were proposed in this pull request?
An option to disable fallback is provided.

### Why are the changes needed?
It's dangerous to fallback to external shuffle when applications run on both online and offline nodes because online services could be impacted due to a shortage of disk capacity.

### Does this PR introduce _any_ user-facing change?
Yes, fallback to Spark's default shuffle can be disabled by setting `celeborn.client.spark.shuffle.fallback.enabled=false`

### How was this patch tested?
manual test

Closes #2444 from littlexyw/fallback_disable.

Lead-authored-by: xinyuwang1 <xinyuwang1@xiaohongshu.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-05-03 14:32:28 +08:00
CodingCat
9f304798cb [CELEBORN-1388] Use finer grained locks in changePartitionManager
### What changes were proposed in this pull request?

this PR proposes to use finer grained lock in  changePartitionManager when handling requests for different partitions

### Why are the changes needed?

we observed the intensive competition of locks when there are many partition got split. most of  change-partition-executor threads are competing for the concurrenthashmap used in ChangePartitionManager...this concurrentHashMap is holding request per partition but we are lock at the whole map instead of per partition level,

with this change, the driver memory footprint is significantly reduced due to the increased processing throughput...

### Does this PR introduce _any_ user-facing change?

one more configs

### How was this patch tested?

prod

Closes #2462 from CodingCat/finer_grained_locks.

Authored-by: CodingCat <zhunansjtu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-04-30 19:51:07 +08:00
Mridul Muralidharan
29b5586a60 [CELEBORN-1353] Document Celeborn security - authentication and SSL support
### What changes were proposed in this pull request?

User documentation for configuring SSL and authentication

### Why are the changes needed?

Document the new features.

### Does this PR introduce _any_ user-facing change?
Yes, updates documentation

### How was this patch tested?

Used `mkdocs serve` to render and validate the documentation.

Closes #2481 from mridulm/ssl-auth-documentation.

Lead-authored-by: Mridul Muralidharan <mridul@gmail.com>
Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-04-30 14:37:56 +08:00
Mridul Muralidharan
f27ede42c4
[CELEBORN-1356] Split rpc module into rpc_app and rpc_service
### What changes were proposed in this pull request?

Split the `rpc` transport module into `rpc_app` and `rpc_service` to allow for them to be independently configured.

### Why are the changes needed?

We need the ability to independently configure communication between application components (driver/executors in spark applications) and those to/from Celeborn service (master/workers) components.

This is particularly relevant for TLS support where applications might be running with TLS disabled for their rpc services or using self-signed certificates (see CELEBORN-1354 for an example), while services would have signed certs.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to independently configure rpc env within the application and those to/from services.
Backward compatibility is maintained - and so existing `rpc` is the fallback in case `rpc_app` or `rpc_service` config is not found.

### How was this patch tested?

Unit tests were enhanced, existing tests pass.

Closes #2460 from mridulm/split_rpc_module-retry1.

Lead-authored-by: Mridul Muralidharan <mridul@gmail.com>
Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-04-17 14:59:23 +08:00
CodingCat
121395f0f5 [CELEBORN-1314] add capacity-bounded inbox for rpc endpoint
### What changes were proposed in this pull request?

we found a lot of driver OOM issue when dealing with spark applications with super large shuffle,

with the heap dump, we found the inbox of rpc endpoints accumulated tons of change partition location message... even we have increased splitThreshold to 10G, many jobs still have this issue (keep increasing this value will increase the risk of disk overusage of workers)

This PR implements capacity-bounded inbox which is based on a LinkedBlockingQueue with a configured capacity, we found it effectively resolves the problem for us

### Why are the changes needed?

the following screenshots show the main memory consumer in Driver side

<img width="661" alt="image" src="https://github.com/apache/incubator-celeborn/assets/678008/d63196cc-6c3c-4b32-a9db-9871e7cb5fd8">
<img width="723" alt="image" src="https://github.com/apache/incubator-celeborn/assets/678008/64a506c4-03ea-4932-98ba-f8f4923daa6e">

### Does this PR introduce _any_ user-facing change?

no, but two more configurations

### How was this patch tested?

integration tests and unit tests

screenshot showing the application driver memory usage with the patch (blue line)

<img width="766" alt="image" src="https://github.com/apache/incubator-celeborn/assets/678008/86ecaba8-c164-4aef-ad83-cee03238e5da">

screenshot showing the application driver memory usage without patch (brown line)

<img width="799" alt="image" src="https://github.com/apache/incubator-celeborn/assets/678008/a012e0ba-0292-4d25-a7b9-252bdc3cb8cb">

Closes #2366 from CodingCat/memory_bounded_driver.

Authored-by: CodingCat <zhunansjtu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-04-16 10:56:32 +08:00
SteNicholas
3ac769e4fa
[CELEBORN-1236][FOLLOWUP] Gauge is_terminating, is_terminated and is_shutdown should represent a single numerical value
### What changes were proposed in this pull request?

Gauge `is_terminating`, `is_terminated` and `is_shutdown` should represent a single numerical value instead of boolean value.

### Why are the changes needed?

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. The value type of `is_terminating`, `is_terminated` and `is_shutdown` should be numerical, otherwise `AbstractSource#addGauge` would warn the failed log as follows:

```
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminating failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminated failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_shutdown failed, the value type class java.lang.Boolean is not a number
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2457 from SteNicholas/CELEBORN-1236.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-15 11:34:34 +08:00
SteNicholas
1d3558bd14 [CELEBORN-1385] HttpServer support idle timeout configuration of Jetty
### What changes were proposed in this pull request?

Introduce `celeborn.master.http.idleTimeout` and `celeborn.worker.http.idleTimeout` to support idle timeout configuration of Jetty for `HttpServer`.

### Why are the changes needed?

`ServerConnector` supports HTTP idle timeout configuration via `jetty.http.idleTimeout`, of which default value is 30000ms that is configured as `jetty.http.idleTimeout=300000`. `HttpServer` should also support idle timeout configuration of Jetty, which timeout is as follows:

```
2024-04-12 16:04:00,926 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(161) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=29999/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout check, elapsed: 29999 ms, remaining: 1 ms
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(161) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout check, elapsed: 30001 ms, remaining: -1 ms
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(168) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout expired
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.FillInterest -FillInterest.java(136) -onFail FillInterest6cc48840{AC.ReadCB2f88da0c{HttpConnection2f88da0c::SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0}}}
java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
    at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
    at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_162]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.http.HttpParser -HttpParser.java(1883) -close HttpParser{s=START,0 of -1}
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.http.HttpParser -HttpParser.java(1912) -START --> CLOSE
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2455 from SteNicholas/CELEBORN-1385.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-04-14 12:40:57 +08:00
Aravind Patnam
f04ebccd4d
[CELEBORN-1368] Log celeborn config for debugging purposes
### What changes were proposed in this pull request?
Log celeborn config for debugging purposes.

### Why are the changes needed?
Help with debugging

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested the patch internally.

Closes #2442 from akpatnam25/CELEBORN-1368.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 15:11:35 +08:00
CodingCat
c788c38025
[CELEBORN-1328] Introduce ActiveSlotsCount metric to monitor the number of active slots
### What changes were proposed in this pull request?

Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Why are the changes needed?

It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

![image](https://github.com/apache/incubator-celeborn/assets/678008/c05aa763-11ad-4bbd-9ae0-dd6a9cb01ac5)

Closes #2386 from CodingCat/slots_decrease.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 11:08:05 +08:00
SteNicholas
82022a9427
[CELEBORN-1362] Remove unnecessary configuration celeborn.client.flink.inputGate.minMemory and celeborn.client.flink.resultPartition.minMemory
### What changes were proposed in this pull request?

Remove unnecessary configuration `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory`.

### Why are the changes needed?

`celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory` are configured as min memory reserved at present. Meanwhile, `celeborn.client.flink.inputGate.memory` should be at least `networkBufferSize * MIN_BUFFERS_PER_GATE` bytes, and `celeborn.client.flink.resultPartition.memory` should be at least `networkBufferSize * MIN_BUFFERS_PER_PARTITION` bytes. Therefore, `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory` are unnecessary configuration for `celeborn.client.flink.inputGate.memory` and `celeborn.client.flink.resultPartition.memory`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`PluginSideConfSuiteJ#testCoalesce`

Closes #2433 from SteNicholas/CELEBORN-1362.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-01 11:15:14 +08:00
SteNicholas
ff2bc92067 [CELEBORN-1317][FOLLOWUP] Update default value of celeborn.master.http.maxWorkerThreads and celeborn.worker.http.maxWorkerThreads via QueuedThreadPool
### What changes were proposed in this pull request?

Update default value of `celeborn.master.http.maxWorkerThreads` and `celeborn.worker.http.maxWorkerThreads` via `QueuedThreadPool`, of which default value is 200.

### Why are the changes needed?

`QueuedThreadPool` determines that the default minimum threads is 8, and the default maximum threads is 200 in [QueuedThreadPool#L121](48f6ab7289/jetty-core/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java (L1210)) and [QueuedThreadPool#L125](48f6ab7289/jetty-core/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java (L125)).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2428 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-29 11:56:04 +08:00
Fei Wang
adbc77cd4f [CELEBORN-1317] Refine celeborn http server and support swagger ui
### What changes were proposed in this pull request?

Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.

In this PR, we refine the code for http service and provide swagger ui.

Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.

TODO:
1. define DTO
2. http request authentication

<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">

<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">

### Why are the changes needed?

To close CELEBORN-1317

### Does this PR introduce _any_ user-facing change?

The api is align with before.

### How was this patch tested?
UT.

Closes #2371 from turboFei/jetty.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-27 23:18:18 +08:00
Mridul Muralidharan
b14254be9a
[CELEBORN-1349] Add SSL related configs and support for ReloadingX509TrustManager
Add SSL related configs and support for `ReloadingX509TrustManager`, required for enabling SSL support.
Please see #2416 for the consolidated PR with all the changes for reference.

Introduces SSL related configs for enabling and configuring use of TLS.

Yes, introduces configs to control behavior of SSL

The overall PR #2411 (and this PR as well) passes all tests, this is specifically pulling out the `ReloadingX509TrustManager` and config related changes

Closes #2419 from mridulm/config-for-ssl.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-27 18:21:14 +08:00
SteNicholas
c9b878a2f5
[INFRA] Remove incubator/incubating for graduation
### What changes were proposed in this pull request?

Remove incubator/incubating for graduation including:

- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.

Fix #2415.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2421 from SteNicholas/infra-graduation.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-27 13:54:47 +08:00
SteNicholas
e29f013e3a [CELEBORN-1357] AbstractRemoteShuffleResultPartitionFactory should remove the check of shuffle compression codec
### What changes were proposed in this pull request?

`AbstractRemoteShuffleResultPartitionFactory` removes the check of shuffle compression codec.

### Why are the changes needed?

`AbstractRemoteShuffleResultPartitionFactory` checks whether shuffle compression codec is LZ4 for Flink 1.14 and 1.15 version at present. Meanwhile, since Flink 1.17 version, ZSTD has already been supported. Therefore `AbstractRemoteShuffleResultPartitionFactory` should remove the check of shuffle compression codec for Flink 1.17 version and above, which is checked via the constructor of `BufferCompressor`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `RemoteShuffleResultPartitionFactorySuiteJ`

Closes #2414 from SteNicholas/CELEBORN-1357.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 15:44:45 +08:00
lvshuang.xjs
9497d557e6
[CELEBORN-1345] Add a limit to the master's estimated partition size
### What changes were proposed in this pull request?
Currently, the Celeborn master calculates the estimatedPartitionSize based on the fileInfo committed by the application. This estimate is then used to allocate slots across all workers. However, this partition size may be too large or too small for Celeborn. For example, if an application commits a single file of 1TB to only one worker, using that partition size could result in all other workers having no available slots or only very small slots. To improve this, it would be better to implement a cap on the master's estimated partition size to prevent such imbalances.

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2412 from RexXiong/CELEBORN-1345.

Lead-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-25 14:40:47 +08:00
SteNicholas
73cf1562f7 [CELEBORN-1299] Introduce JVM profiling in Celeborn Worker using async-profiler
### What changes were proposed in this pull request?

Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles.

### Why are the changes needed?

[async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.

Backport [[SPARK-46094] Support Executor JVM Profiling](https://github.com/apache/spark/pull/44021).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Worker cluster test.

Closes #2409 from SteNicholas/CELEBORN-1299.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 14:05:50 +08:00
CodingCat
c6c319d865 [CELEBORN-1309][FOLLOWUP] Cap the max memory can be used for sort buffer
### What changes were proposed in this pull request?

add a new parameter to cap the max memory can be used for sort writer buffer

### Why are the changes needed?

with a huge number of partitions, the threshold based on buffer size * number of partitions without this cap can be too large, e.g. 64K * 100000 = 6G

### Does this PR introduce _any_ user-facing change?

a new parameter

### How was this patch tested?

ut

Closes #2388 from CodingCat/adaptive_followup.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-25 12:08:04 +08:00
SteNicholas
8fbcbead48
[CELEBORN-1341][FOLLOWUP] Improve Celeborn document
### What changes were proposed in this pull request?

Improve Celeborn document to fix typos, formats, unvalid link and unsynced default value of document. Meanwhile, the public interfaces of `shuffleclient.md` keep the consistent with `ShuffleClient`.

### Why are the changes needed?

There are some typos, formats, unvalid link and unsynced default value fixes in Celeborn document at present. Meanwhile, the public interfaces of `shuffleclient.md` is inconsistent with `ShuffleClient`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2410 from SteNicholas/CELEBORN-1341.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-22 16:34:25 +08:00
SteNicholas
a371f934cf
[CELEBORN-1341] Improve Celeborn document
### What changes were proposed in this pull request?

Improve Celeborn document to fix typos, table formats and wrong description of document. Meanwhile, `deploy.md` adds the document of MapReduce client deployment.

### Why are the changes needed?

There are some typos and format fixes in Celeborn document at present. Meanwhile, the `deploy.md` does not contain the deployment of MapReduce client, which is inconsistent with `README.md` for Flink configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2407 from SteNicholas/CELEBORN-1341.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-20 15:02:05 +08:00
SteNicholas
adaa96fc60 [CELEBORN-1310][FLINK] Support Flink 1.19
### What changes were proposed in this pull request?

Support Flink 1.19.

### Why are the changes needed?

Flink 1.19.0 is announced to release: [Announcing the Release of Apache Flink 1.19] (https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19).

The main changes includes:

- `org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel` constructor change parameters:
   - `consumedSubpartitionIndex` changes to `consumedSubpartitionIndexSet`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
   - adds `partitionRequestListenerTimeout`: [[FLINK-25055][network] Support listen and notify mechanism for partition request](https://github.com/apache/flink/pull/23565).
- `org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor removes parameters `subpartitionIndexRange`, `tieredStorageConsumerClient`, `nettyService` and `tieredStorageConsumerSpecs`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- Change the default config file to `config.yaml` in `flink-dist`: [[FLINK-33577][dist] Change the default config file to config.yaml in flink-dist](https://github.com/apache/flink/pull/24177).
- `org.apache.flink.configuration.RestartStrategyOptions` uses `org.apache.commons.compress.utils.Sets` of `commons-compress` dependency: [[FLINK-33865][runtime] Adding an ITCase to ensure exponential-delay.attempts-before-reset-backoff works well](https://github.com/apache/flink/pull/23942).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test:

- Flink batch job submission

```
$ ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 2e9fb659991a9c29d376151783bdf6de
Program execution finished
Job with JobID 2e9fb659991a9c29d376151783bdf6de has finished.
Job Runtime: 1912 ms
```

- Flink batch job execution

![image](https://github.com/apache/incubator-celeborn/assets/10048174/18b60861-cafc-4df3-b94d-93307e728be2)

- Celeborn master log
```

24/03/18 20:52:47,513 INFO [celeborn-dispatcher-42] Master: Offer slots successfully for 1 reducers of 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 on 1 workers.
```

- Celeborn worker log
```
24/03/18 20:52:47,704 INFO [celeborn-dispatcher-1] StorageManager: created file at /Users/nicholas/Software/Celeborn/apache-celeborn-0.5.0-SNAPSHOT/shuffle/celeborn-worker/shuffle_data/1710766312631-2e9fb659991a9c29d376151783bdf6de/0/0-0-0
24/03/18 20:52:47,707 INFO [celeborn-dispatcher-1] Controller: Reserved 1 primary location and 0 replica location for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,874 INFO [celeborn-dispatcher-2] Controller: Start commitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,890 INFO [worker-rpc-async-replier] Controller: CommitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
```

Closes #2399 from SteNicholas/CELEBORN-1310.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-20 11:51:23 +08:00
sychen
91f6378682 [CELEBORN-1336] Remove client partition split pool
### What changes were proposed in this pull request?

### Why are the changes needed?
`CELEBORN-1320` uses `ReviveManager` to batch processing SOFT_SPLIT event RPC, so `partitionSplitPool` is no longer used, and the configuration item `celeborn.client.push.splitPartition.threads` is meaningless.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2396 from cxzl25/CELEBORN-1336.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-18 21:48:59 +08:00
ForVic
15b0f16f74 [MINOR] Fix typo in developer docs - overview
### What changes were proposed in this pull request?
To fix a typo.

### Why are the changes needed?
To maintain the quality of Celeborn documentation.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
N/A

Closes #2397 from ForVic/forvic/fix_typo.

Authored-by: ForVic <victor.lakers0@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-17 16:15:16 +08:00
SteNicholas
d33ac28945 [MINOR] Fix typo of celeborn.network.bind.preferIpAddress doc
### What changes were proposed in this pull request?

Fix typo of `celeborn.network.bind.preferIpAddress` doc from `ture` to `true`.

### Why are the changes needed?

`celeborn.network.bind.preferIpAddress` doc has typo for `ture`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2392 from SteNicholas/prefer-ip-address.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-14 17:12:33 +08:00
CodingCat
2e251457f3 [CELEBORN-1309] Support adaptive management of memory threshold for SortBasedWriter
### What changes were proposed in this pull request?

while SortBasedWriter has less memory footprint than HashBasedWriter, it suffers from performance issue when we have  many partitions and the write buffer is filled with small chunks of data quickly

 for example, if sort buffer size is 32K, you have 4 partitions and 128K data in total, the data distribution is like partition A, B, C, D, each time it comes with 8K per partition.... in this case, you need to compress and send small 8K chunk 4 times per partition , the cost would become very high. If you use hashbasedwriter, it doesn't have this problem since the push only happens when the per-partition buffer is full. Of course , larger sort buffer size can mitigate the issue, but tuning sort buffer size per job is a tedious work

this PR introduces a new feature that we measure total size of pushed bytes and pushed count as well as the "should-pushed" bytes and counts (should-push means that , the data we pushed is larger than CLIENT_PUSH_BUFFER_MAX_SIZE (in another word, we will trigger a push even with hashbasedwriter in this case))

when actualPushedBytes/actualPushedCounts > (1 + Threshold) * (ShouldPushBytes/ShouldPushCounts), we will enlarge the sort buffer size by 1X to try to buffer more data before pushing  (the max size of sortBuffer would be capped at # of partitions * CLIENT_PUSH_BUFFER_MAX_SIZE)

### Why are the changes needed?

to reduce perf cost in sortbased writer

### Does this PR introduce _any_ user-facing change?

no, but have 2 extra configurations

### How was this patch tested?

in prod of our company and also unit test

Closes #2358 from CodingCat/adaptive_memory_threshold.

Authored-by: CodingCat <zhunansjtu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-13 13:54:12 +08:00
SteNicholas
0054930ce7
[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list
### What changes were proposed in this pull request?

Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.

<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">

### Why are the changes needed?

`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)

Closes #2379 from SteNicholas/CELEBORN-1323.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 16:01:22 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
Chandni Singh
96df0d6e3c [CELEBORN-1179] Add support in Celeborn Workers to fetch application meta from the Master
### What changes were proposed in this pull request?
This enables a Celeborn Worker to retrieve the application meta from the Master if it hasn't received the secret from the Master before the application attempts to connect to it. Additionally, the Celeborn Worker's SecretRegistry has been converted into an LRU cache to prevent unbounded growth of the registry.

### Why are the changes needed?
This is last change needed for Auth support in Celeborn (https://issues.apache.org/jira/browse/CELEBORN-1011)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs and part of a bigger change which will be tested end-to-end.

Closes #2363 from otterc/CELEBORN-1179.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-11 14:15:08 +08:00
SteNicholas
36f2e03138
[MINOR] Fix style and Gluten link in Developers Doc
### What changes were proposed in this pull request?

Fix style and Gluten link in Developers Doc.

### Why are the changes needed?

- `slotsallocation.md` has the following wrong style:

<img width="1434" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/97fb53ed-473d-4f3d-8231-1fb613df9132">

- Gluten is apache incubating projetc, of which the link of Gluten project should be [Gluten](https://github.com/apache/incubator-gluten).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2375 from SteNicholas/developers-doc.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 12:07:01 +08:00
SteNicholas
07d342151c [CELEBORN-1284][FOLLOWUP] Fix license style of quota_management.md
### What changes were proposed in this pull request?

Fix license style of `quota_management.md`.

### Why are the changes needed?

The license style of `quota_management.md` is wrong.

<img width="1438" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/4a00724d-5fec-4b25-b134-d814c3152efd">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2374 from SteNicholas/CELEBORN-1284.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-11 11:37:30 +08:00
Chandni Singh
6897b8be99 [CELEBORN-1234] Master should persist the application meta in Ratis and push it to the Workers
### What changes were proposed in this pull request?
This enables Celeborn Master to persist application meta in Ratis and also push it to Celeborn Workers when it receives the requests for slots from the LifecycleManager.

### Why are the changes needed?
This change is required for adding authentication. ([CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011)).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added some UTs.

Closes #2346 from otterc/CELEBORN-1234.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-06 17:02:04 +08:00
SteNicholas
ce5386397d
[CELEBORN-1134][FOLLOWUP] Add execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING to Flink Configuration of Deploy Flink client
### What changes were proposed in this pull request?

Add `execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING` to `Flink Configuration` of `Deploy Flink client` in `deploy.md`

### Why are the changes needed?

Validation whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING` is supported in #2106. `Flink Configuration` of `Deploy Flink client` should also add this configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2355 from SteNicholas/CELEBORN-1134.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-04 15:57:40 +08:00
SteNicholas
93d2b9f47f [CELEBORN-1298][FOLLOWUP] Support Spark2.4 with Scala2.12
### What changes were proposed in this pull request?

Support Spark2.4 with Scala2.12 in `sbt.md`. Meanwhile, the CI workflow adds the test for Spark2.4 and Scala2.12.

Follow up #2344.

### Why are the changes needed?

Spark2.4 with Scala2.12 is supported.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2345 from SteNicholas/CELEBORN-1298.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-29 21:48:51 +08:00
Chandni Singh
d5a1bcdb6d [CELEBORN-1256] Added internal port and auth support to Celeborn worker
### What changes were proposed in this pull request?

This adds an internal port and auth support to Celeborn Wokers.
1. Internal port is used by a worker to receive messages from Celeborn Master.
2. Authentication support for secure communication with clients. This change doesn't add the support in clients to communicate to the Workers securely. That will be in a future change.

This change targets just adding the port and auth support to Worker. The following items from the proposal are still pending:

- Persisting the app secrets in Ratis.
- Forwarding secrets to Workers and having ability for the workers to pull registration info from the Master.
- Secured communication between workers and clients.

### Why are the changes needed?
It is needed for adding authentication support to Celeborn ([CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011))

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Part of a bigger change. For this change, only modified existing UTs.

Closes #2292 from otterc/CELEBORN-1256.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-29 10:09:22 +08:00