[CELEBORN-1341][FOLLOWUP] Improve Celeborn document
### What changes were proposed in this pull request? Improve Celeborn document to fix typos, formats, unvalid link and unsynced default value of document. Meanwhile, the public interfaces of `shuffleclient.md` keep the consistent with `ShuffleClient`. ### Why are the changes needed? There are some typos, formats, unvalid link and unsynced default value fixes in Celeborn document at present. Meanwhile, the public interfaces of `shuffleclient.md` is inconsistent with `ShuffleClient`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2410 from SteNicholas/CELEBORN-1341. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
This commit is contained in:
parent
d62f75fdc7
commit
8fbcbead48
@ -46,7 +46,7 @@ For more information of Celeborn configurations, see [CONFIGURATIONS](../CONFIGU
|
||||
|
||||
#### Install Celeborn
|
||||
```
|
||||
helm install celeborn ${CELEBORN_HOME}/charts/celebron -n ${celeborn namespace}
|
||||
helm install celeborn ${CELEBORN_HOME}/charts/celeborn -n ${celeborn namespace}
|
||||
```
|
||||
|
||||
#### Connect to Celeborn in K8s pod
|
||||
|
||||
@ -20,7 +20,7 @@ license: |
|
||||
---
|
||||
Quick Start
|
||||
===
|
||||
This documentation gives a quick start guide for running Apache Spark/Flink/MapReduce with Apache Celeborn™(Incubating).
|
||||
This documentation gives a quick start guide for running Spark/Flink/MapReduce with Apache Celeborn™(Incubating).
|
||||
|
||||
### Download Celeborn
|
||||
Download the latest Celeborn binary from the [Downloading Page](https://celeborn.apache.org/download/).
|
||||
@ -126,11 +126,13 @@ cp $CELEBORN_HOME/flink/<Celeborn Client Jar> $FLINK_HOME/lib/
|
||||
```
|
||||
#### Add Celeborn configuration to Flink's conf
|
||||
Set `shuffle-service-factory.class` to Celeborn's ShuffleServiceFactory in Flink configuration file:
|
||||
|
||||
- Flink 1.14.x, Flink 1.15.x, Flink 1.17.x, Flink 1.18.x
|
||||
```shell
|
||||
cd $FLINK_HOME
|
||||
vi conf/flink-conf.yaml
|
||||
```
|
||||
|
||||
- Flink 1.19.x
|
||||
```shell
|
||||
cd $FLINK_HOME
|
||||
|
||||
@ -189,7 +189,7 @@ Example can refer to [Hadoop Rack Awareness](https://hadoop.apache.org/docs/stab
|
||||
|
||||
`ShuffleClient` records the shuffle partition location's host, service port, and filename,
|
||||
to support workers recovering reading existing shuffle data after worker restart,
|
||||
during worker shutdown, workers should store the meta about reading shuffle partition files in LevelDB,
|
||||
during worker shutdown, workers should store the meta about reading shuffle partition files in RocksDB or LevelDB(deprecated),
|
||||
and restore the meta after restarting workers, also workers should keep a stable service port to support
|
||||
`ShuffleClient` retry reading data. Users should set `celeborn.worker.graceful.shutdown.enabled` to `true` and
|
||||
set below service port with stable port to support worker recover status.
|
||||
|
||||
@ -33,7 +33,7 @@ Celeborn currently supports rapid deployment by using helm.
|
||||
|
||||
### 1. Get Celeborn Binary Package
|
||||
|
||||
You can find released version of Celeborn on https://celeborn.apache.org/download/.
|
||||
You can find released version of Celeborn on [Downloading Page](https://celeborn.apache.org/download/).
|
||||
|
||||
Of course, you can build binary package from master branch or your own branch by using `./build/make-distribution.sh` in
|
||||
source code.
|
||||
@ -139,7 +139,7 @@ network infrastructure, this may cause pressure on DNS service or other network
|
||||
|
||||
### 6. Build Celeborn Client
|
||||
|
||||
Here, without going into detail on how to configure spark/flink to find celeborn master/worker, mention the key
|
||||
Here, without going into detail on how to configure Spark/Flink/MapReduce to find celeborn master/worker, mention the key
|
||||
configuration:
|
||||
|
||||
```
|
||||
@ -149,5 +149,5 @@ spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.<namespac
|
||||
You can find why config endpoints such way
|
||||
in [Kubernetes DNS for Service And Pods](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)
|
||||
|
||||
> Notice: You should ensure that Spark/Flink can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned
|
||||
> Notice: You should ensure that Spark/Flink/MapReduce can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned
|
||||
> above
|
||||
|
||||
@ -20,8 +20,7 @@ license: |
|
||||
## Overview
|
||||
The core components of Celeborn, i.e. `Master`, `Worker`, and `Client` are all engine irrelevant. Developers can
|
||||
integrate Celeborn with various engines or applications by using or extending Celeborn's `Client`, as the officially
|
||||
supported plugins for Apache Spark and Apache Flink, see [Spark Plugin](../../developers/spark) and
|
||||
[Flink Plugin](../../developers/flink).
|
||||
supported plugins for Spark/Flink/MapReduce.
|
||||
|
||||
This article briefly describes an example of integrating Celeborn into a simple distributed application using
|
||||
Celeborn `Client`.
|
||||
|
||||
@ -104,7 +104,7 @@ When graceful shutdown is turned on, upon shutdown, Celeborn will do the followi
|
||||
2. Worker will inform Clients to split.
|
||||
3. Client will send `CommitFiles` to the Worker.
|
||||
|
||||
Then the Worker waits until all `PartitionLocation` flushes data to persistent storage, stores states in local leveldb/rocksdb,
|
||||
Then the Worker waits until all `PartitionLocation` flushes data to persistent storage, stores states in local RocksDB or LevelDB(deprecated),
|
||||
then stops itself. The process is typically within one minute.
|
||||
|
||||
For more details, please refer to [Rolling upgrade](../../upgrading/#rolling-upgrade)
|
||||
|
||||
@ -124,19 +124,25 @@ to guarantee no data is lost.
|
||||
```java
|
||||
public abstract CelebornInputStream readPartition(
|
||||
int shuffleId,
|
||||
int appShuffleId,
|
||||
int partitionId,
|
||||
int attemptNumber,
|
||||
int startMapIndex,
|
||||
int endMapIndex)
|
||||
int endMapIndex,
|
||||
ExceptionMaker exceptionMaker,
|
||||
MetricsCallback metricsCallback)
|
||||
```
|
||||
|
||||
- `shuffleId` is the unique shuffle id of the application
|
||||
- `shuffleId` is the unique shuffle id of Celeborn
|
||||
- `appShuffleId` is the unique shuffle id of the application
|
||||
- `partitionId` is the partition id to read from
|
||||
- `attemptNumber` is the attempt id of reduce task, can be safely set to any value
|
||||
- `startMapIndex` is the index of start map index of interested map range, set to 0 if you want to read all
|
||||
partition data
|
||||
- `endMapIndex` is the index of end map index of interested map range, set to `Integer.MAX_VALUE` if you want
|
||||
to read all partition data
|
||||
- `exceptionMaker` is the marker of exception including fetch failure exception.
|
||||
- `metricsCallback` is the callback of monitoring metrics to increase read bytes and time etc.
|
||||
|
||||
The returned input stream is guaranteed to be `Exactly Once`, meaning no data lost and no duplicated reading, or else
|
||||
an exception will be thrown, see [Here](../../developers/faulttolerant#exactly-once).
|
||||
|
||||
@ -50,7 +50,7 @@ Users can increase the configuration value appropriately according to the situat
|
||||
Shuffle client records the shuffle partition location's host, service port, and filename,
|
||||
to support workers recovering reading existing shuffle data after worker restart,
|
||||
during worker shutdown, workers should store the meta about reading shuffle partition files
|
||||
in LevelDB, and restore the meta after restarting workers.
|
||||
in RocksDB or LevelDB(deprecated), and restore the meta after restarting workers.
|
||||
Users should set `celeborn.worker.graceful.shutdown.enabled` to `true` to enable graceful shutdown.
|
||||
During this process, worker will wait all allocated partition's in this worker to be committed
|
||||
within a timeout of `celeborn.worker.graceful.shutdown.checkSlotsFinished.timeout`, which default value is `480s`.
|
||||
@ -70,7 +70,7 @@ In order to speed up the restart process, worker let all push data requests retu
|
||||
during worker shutdown, and shuffle client will re-apply for a new partition location for these allocated partitions.
|
||||
Then client side can record all HARD_SPLIT partition information and pre-commit these partition,
|
||||
then the worker side allocated partitions can be committed in a very short time. User should enable
|
||||
`celeborn.client.shuffle.batchHandleCommitPartition.enabled`, the default value is false.
|
||||
`celeborn.client.shuffle.batchHandleCommitPartition.enabled`, the default value is true.
|
||||
|
||||
### Example setting
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user