From 8fbcbead485ca5df8da103271a1f22b42451a325 Mon Sep 17 00:00:00 2001 From: SteNicholas Date: Fri, 22 Mar 2024 16:34:25 +0800 Subject: [PATCH] [CELEBORN-1341][FOLLOWUP] Improve Celeborn document ### What changes were proposed in this pull request? Improve Celeborn document to fix typos, formats, unvalid link and unsynced default value of document. Meanwhile, the public interfaces of `shuffleclient.md` keep the consistent with `ShuffleClient`. ### Why are the changes needed? There are some typos, formats, unvalid link and unsynced default value fixes in Celeborn document at present. Meanwhile, the public interfaces of `shuffleclient.md` is inconsistent with `ShuffleClient`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2410 from SteNicholas/CELEBORN-1341. Authored-by: SteNicholas Signed-off-by: mingji --- docker/DEPLOY_ON_K8S.md | 2 +- docs/README.md | 4 +++- docs/configuration/index.md | 2 +- docs/deploy_on_k8s.md | 6 +++--- docs/developers/integrate.md | 3 +-- docs/developers/overview.md | 2 +- docs/developers/shuffleclient.md | 10 ++++++++-- docs/upgrading.md | 4 ++-- 8 files changed, 20 insertions(+), 13 deletions(-) diff --git a/docker/DEPLOY_ON_K8S.md b/docker/DEPLOY_ON_K8S.md index acb074a24..6976301d7 100644 --- a/docker/DEPLOY_ON_K8S.md +++ b/docker/DEPLOY_ON_K8S.md @@ -46,7 +46,7 @@ For more information of Celeborn configurations, see [CONFIGURATIONS](../CONFIGU #### Install Celeborn ``` -helm install celeborn ${CELEBORN_HOME}/charts/celebron -n ${celeborn namespace} +helm install celeborn ${CELEBORN_HOME}/charts/celeborn -n ${celeborn namespace} ``` #### Connect to Celeborn in K8s pod diff --git a/docs/README.md b/docs/README.md index 613431984..24b3ffa6b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -20,7 +20,7 @@ license: | --- Quick Start === -This documentation gives a quick start guide for running Apache Spark/Flink/MapReduce with Apache Celeborn™(Incubating). +This documentation gives a quick start guide for running Spark/Flink/MapReduce with Apache Celeborn™(Incubating). ### Download Celeborn Download the latest Celeborn binary from the [Downloading Page](https://celeborn.apache.org/download/). @@ -126,11 +126,13 @@ cp $CELEBORN_HOME/flink/ $FLINK_HOME/lib/ ``` #### Add Celeborn configuration to Flink's conf Set `shuffle-service-factory.class` to Celeborn's ShuffleServiceFactory in Flink configuration file: + - Flink 1.14.x, Flink 1.15.x, Flink 1.17.x, Flink 1.18.x ```shell cd $FLINK_HOME vi conf/flink-conf.yaml ``` + - Flink 1.19.x ```shell cd $FLINK_HOME diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 902c15b31..af8867080 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -189,7 +189,7 @@ Example can refer to [Hadoop Rack Awareness](https://hadoop.apache.org/docs/stab `ShuffleClient` records the shuffle partition location's host, service port, and filename, to support workers recovering reading existing shuffle data after worker restart, -during worker shutdown, workers should store the meta about reading shuffle partition files in LevelDB, +during worker shutdown, workers should store the meta about reading shuffle partition files in RocksDB or LevelDB(deprecated), and restore the meta after restarting workers, also workers should keep a stable service port to support `ShuffleClient` retry reading data. Users should set `celeborn.worker.graceful.shutdown.enabled` to `true` and set below service port with stable port to support worker recover status. diff --git a/docs/deploy_on_k8s.md b/docs/deploy_on_k8s.md index fa8ad91e6..16597d6ce 100644 --- a/docs/deploy_on_k8s.md +++ b/docs/deploy_on_k8s.md @@ -33,7 +33,7 @@ Celeborn currently supports rapid deployment by using helm. ### 1. Get Celeborn Binary Package -You can find released version of Celeborn on https://celeborn.apache.org/download/. +You can find released version of Celeborn on [Downloading Page](https://celeborn.apache.org/download/). Of course, you can build binary package from master branch or your own branch by using `./build/make-distribution.sh` in source code. @@ -139,7 +139,7 @@ network infrastructure, this may cause pressure on DNS service or other network ### 6. Build Celeborn Client -Here, without going into detail on how to configure spark/flink to find celeborn master/worker, mention the key +Here, without going into detail on how to configure Spark/Flink/MapReduce to find celeborn master/worker, mention the key configuration: ``` @@ -149,5 +149,5 @@ spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc. Notice: You should ensure that Spark/Flink can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned +> Notice: You should ensure that Spark/Flink/MapReduce can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned > above diff --git a/docs/developers/integrate.md b/docs/developers/integrate.md index 22ec50df8..65f14a20c 100644 --- a/docs/developers/integrate.md +++ b/docs/developers/integrate.md @@ -20,8 +20,7 @@ license: | ## Overview The core components of Celeborn, i.e. `Master`, `Worker`, and `Client` are all engine irrelevant. Developers can integrate Celeborn with various engines or applications by using or extending Celeborn's `Client`, as the officially -supported plugins for Apache Spark and Apache Flink, see [Spark Plugin](../../developers/spark) and -[Flink Plugin](../../developers/flink). +supported plugins for Spark/Flink/MapReduce. This article briefly describes an example of integrating Celeborn into a simple distributed application using Celeborn `Client`. diff --git a/docs/developers/overview.md b/docs/developers/overview.md index 493453a37..b9ff00431 100644 --- a/docs/developers/overview.md +++ b/docs/developers/overview.md @@ -104,7 +104,7 @@ When graceful shutdown is turned on, upon shutdown, Celeborn will do the followi 2. Worker will inform Clients to split. 3. Client will send `CommitFiles` to the Worker. -Then the Worker waits until all `PartitionLocation` flushes data to persistent storage, stores states in local leveldb/rocksdb, +Then the Worker waits until all `PartitionLocation` flushes data to persistent storage, stores states in local RocksDB or LevelDB(deprecated), then stops itself. The process is typically within one minute. For more details, please refer to [Rolling upgrade](../../upgrading/#rolling-upgrade) diff --git a/docs/developers/shuffleclient.md b/docs/developers/shuffleclient.md index b48179cad..02b75f5ff 100644 --- a/docs/developers/shuffleclient.md +++ b/docs/developers/shuffleclient.md @@ -124,19 +124,25 @@ to guarantee no data is lost. ```java public abstract CelebornInputStream readPartition( int shuffleId, + int appShuffleId, int partitionId, int attemptNumber, int startMapIndex, - int endMapIndex) + int endMapIndex, + ExceptionMaker exceptionMaker, + MetricsCallback metricsCallback) ``` -- `shuffleId` is the unique shuffle id of the application +- `shuffleId` is the unique shuffle id of Celeborn +- `appShuffleId` is the unique shuffle id of the application - `partitionId` is the partition id to read from - `attemptNumber` is the attempt id of reduce task, can be safely set to any value - `startMapIndex` is the index of start map index of interested map range, set to 0 if you want to read all partition data - `endMapIndex` is the index of end map index of interested map range, set to `Integer.MAX_VALUE` if you want to read all partition data +- `exceptionMaker` is the marker of exception including fetch failure exception. +- `metricsCallback` is the callback of monitoring metrics to increase read bytes and time etc. The returned input stream is guaranteed to be `Exactly Once`, meaning no data lost and no duplicated reading, or else an exception will be thrown, see [Here](../../developers/faulttolerant#exactly-once). diff --git a/docs/upgrading.md b/docs/upgrading.md index 9c7e6cb28..ab78ad109 100644 --- a/docs/upgrading.md +++ b/docs/upgrading.md @@ -50,7 +50,7 @@ Users can increase the configuration value appropriately according to the situat Shuffle client records the shuffle partition location's host, service port, and filename, to support workers recovering reading existing shuffle data after worker restart, during worker shutdown, workers should store the meta about reading shuffle partition files -in LevelDB, and restore the meta after restarting workers. +in RocksDB or LevelDB(deprecated), and restore the meta after restarting workers. Users should set `celeborn.worker.graceful.shutdown.enabled` to `true` to enable graceful shutdown. During this process, worker will wait all allocated partition's in this worker to be committed within a timeout of `celeborn.worker.graceful.shutdown.checkSlotsFinished.timeout`, which default value is `480s`. @@ -70,7 +70,7 @@ In order to speed up the restart process, worker let all push data requests retu during worker shutdown, and shuffle client will re-apply for a new partition location for these allocated partitions. Then client side can record all HARD_SPLIT partition information and pre-commit these partition, then the worker side allocated partitions can be committed in a very short time. User should enable -`celeborn.client.shuffle.batchHandleCommitPartition.enabled`, the default value is false. +`celeborn.client.shuffle.batchHandleCommitPartition.enabled`, the default value is true. ### Example setting